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1   INTRODUCTION 

The  brief  history  of  computer  design  has  been  one  of  constant  and  rapid 
change.  This  change  has  centered  in  the  technology  of  circuit  design.  Logic 
has  grown  cheaper,  smaller,  and  faster  at  dramatic  rates.  With  few  excep- 
tions, the  art  of  computer  design  has  been  largely  one  of  doing  more  and 
more  of  the  same  thing  faster  and  faster.  There  have,  of  course,  been  many 
design  innovations.  However,  one  could  undoubtedly  explain  the  operation  of 
any  modern  computer  to  Babbage's  ghost  in  a  day  or  two.  The  nature  of  the 
game  of  computer  design  is  changing  radically. 

The  cost  and  size  of  logic  continues  to  decline  at  an  accelerating  rate. 
The  speed  of  logic  is  nearing  theoretical  limits,  with  the  speed  of  elec- 
trical signals  being  a  major  design  consideration  in  all  current  high-speed 
computers.  Thus,  barring  some  extremely  dramatic  revolution  in  physics, 
logic  is  not  going  to  get  much  faster.  On  the  other  hand,  if  one  wants  to 
obtain  a  perspective  on  the  limits  of  cost  and  gate  density  for  logic,  one 
might  compute  the  cost  per  gate  and  gate  density  of  a  pigeon  brain.  We  have 
only  the  crudest  ideas  on  how  to  effectively  utilize  the  logic  technology 
that  currently  exists.  With  each  new  advance  in  circuit  technology,  the 
depths  of  this  ignorance  increases.  This  thesis  is  one  primitive  attempt  to 
begin  to  plumb  these  depths. 

The  problems  associated  with  effectively  utilizing  this  technology  can 
be  divided  into  two  broad  categories.  The  first  is  to  determine  what  useful 
structures  one  can  concoct  using  a  very   large  number  of  gates.  The  second 
is  to  determine  which  of  these  structures  can  be  practically  implemented 
given  the  constraints  of  an  existing  technology  and  how  this  may  be  done. 


This  thesis  concentrates  on  the  first  of  these  problems  and  treats  the 
second  only  in  the  most  general  sense.  We  justify  this  approach  because 
simultaneously  solving  both  problems  is  extraordinarily  complex  and  time 
consuming.  Determining  useful  theoretical  designs  must  inevitably  be  the 
first  step  in  generating  practical  designs.  Further,  many  of  the  practical 
constraints  are  in  a  rapid  state  of  change.  The  practical  constraints  that 
we  do  take  into  account  are  ones  that  we  expect  to  exist  in  any  technology. 
These  include  restrictions  on  fan  in  and  fan  out,  and  logic  levels  per  clock. 
In  addition  we  have  adopted  a  general  structural  approach  of  attempting  to 
find  basic  building  blocks  at  a  fairly  high  level  of  complexity.  We  discuss 
this  approach  and  its  relevence  to  IC  technology  in  Chapter  3. 

There  are  two  broad  categorizations  of  approaches  to  determining  how 
to  utilize  this  technology  effectively.  One  is  to  consider  existing  hard- 
ware and  software  techniques  and  see  how  these  may  be  expanded  and  generalized. 
The  alternative  is  to  start  from  scratch.  Existing  programming  hardware  and 
software  techniques  have  evolved  from  the  notion  of  a  simple  arithmetic  unit, 
a  set  of  memory  reigsters,  and  a  single  control  unit.  This  structure  is  a 
natural  one  for  humans  to  understand  and  use.  It  is  unlikely  to  be  of  uni- 
versal significance  for  all  data  processing  problems.  It  is,  in  fact,  our 
belief  that  the  problem  of  investigating  useful  computing  structures  is  co- 
extensive with  the  problem  of  investigating  useful  mathematical  structures. 
We  consider  this  open-ended  approach  to  the  problem  to  be  of  deep  intellec- 
tual fascination.  In  Section  3.3,  we  discuss  some  observations  about  this 
approach.  However,  for  pragmatic  reasons,  the  bulk  of  this  thesis  takes  the 
other  approach.  We  will  now  define  our  approach  and  objectives  in  more 
detail . 


Our  overall  goal  is  to  design  a  good  general-purpose,  expandable,  fast 
parallel  computer.  This  statement  of  objective  is  both  yery   vague,  and  in- 
ternally inconsistent.  Parallelism  implies  a  structuring  of  multiple  com- 
puting elements.  Inevitably  some  algorithms  will  fit  the  structure  better 
than  others.  More  parallelism  implies  less  generality.  However,  we  do  know 
that  eight  arithmetic  units  can  be  effectively  utilized  by  most  FORTRAN  pro- 
grams [Kuck  5].  Thus,  a  limited  degree  of  parallelism  is  not  incompatible 
with  a  fairly  general-purpose  computer.  The  vagueness  in  our  statement  of 
overall  objectives  is  intentional.  Good  computer  design  involves  many  com- 
plex and  ill-defined  factors.  A  precise  statement  of  objectives  is  impos- 
sible. We  can  enumerate  the  important  factors,  and  we  do  so  by  first 
dividing  them  into  two  broad  categories  of  programmability  and  hardware. 

A  machine  with  good  programmability  should  allow  for  the  easy  implemen- 
tation of  a  wide  range  of  languages.  These  should  be  easily  expandable  both 
to  facilitate  the  evolution  of  special-purpose  languages,  and  to  parallel 
hardware  expandability.  The  languages  should  be  efficient  both  in  terms  of 
the  code  they  produce,  and  their  own  execution  time.  Programs  should  be 
easy  to  debug.  The  machine  design  should  allow  for  the  easy  implementation 
of  a  powerful,  efficient  operating  system.  Facilities  should  be  provided 
which  ease  the  burden  of  managing  a  hierarchy  of  memories,  and  in  many  in- 
stances eliminate  it  entirely.  Multiprogramming  and  multiprocessing  features 
should  be  provided  to  facilitate  fast  turnaround  on  short  jobs  and  to  allow 
the  system  to  efficiently  handle  a  broad  range  of  problems. 

The  hardware  should  be  modular,  reliable,  expandable,  cheap,  and  easy 
and  inexpensive  to  maintain. 

These  goals  are  still  quite  vague,  and  it  is  essential  that  they  remain 
so.  Computer  design  is  more  art  than  science,  and  to  pretend  otherwise  can 


lead  to  disaster.  The  objectives  of  computer  design  cannot  be  precisely 
defined  without  gross  oversimplification. 

I  should  say  a  few  words  about  the  results  of  this  research.  It  does 
not  consist  of  any  one  technique  or  conclusion.  Rather,  it  consists  of 
showing  how  a  number  of  techniques  may  be  developed  and  integrated  to  pro- 
duce a  good  machine. 

Although  our  overall  objectives  must  remain  vague,  we  can  be  more  spe- 
cific about  the  techniques  we  will  employ  in  meeting  these  objectives. 
Hardware  designed  for  specialized  purposes  is  potentially  much  more  effi- 
cient than  that  designed  for  more  general  purposes.  Our  goal  is  to  design 
a  general -purpose  computer.  There  are  two  ways  in  which  specialized  hard- 
ware can  be  employed  in  such  a  machine.  First,  we  can  directly  implement 
some  operating  system  and  compiler  functions.  Secondly,  we  can  allow  for 
the  inclusion  of  specialized  but  unspecified  operational  units.  The  great 
danger  in  designing  specialized  hardware  is  that  it  becomes  extremely  effi- 
cient at  doing  what  nobody  cares  to  have  done.  There  do  exist  universal 
compiler  and  operating  system  functions  that  can  be  completely  specified  at 
machine  design  time,  and  can  thus  benefit  from  specialized  hardware.  Opera- 
ting system  functions  include  interrupt  processing,  multiprogramming  alloca- 
tion of  resources,  and  memory  management.  Obvious  compiler  functions  for 
which  some  existing  machines  have  specialized  hardware  include  array  index- 
ing and  subroutine  calls.  Functions  which  are  candidates  for  such  hardware 
include  mapping  the  parallel  structure  of  a  language  onto  the  parallel 
structure  of  a  machine,  hardware  management  of  loops  to  minimize  execution 
time  non-determinism,  and  anticipatory  I/O  scheduling. 


There  can  be  no  way  to  anticipate  what  sort  of  specialized  hardware 
may  be  desirable  or  even  necessary  for  various  applications.  It  is  feasible 
to  design  a  general -purpose  computer  that  will  allow  for  the  later  inclusion 
of  various  specialized  pieces  of  hardware.  To  allow  for  this  and  because 
it  is  a  good  basic  technique,  our  overall  design  philosophy  will  involve  the 
construction  of  various  functional  units.  The  interfaces  between  these  units 
will  be  as  simple  and  at  as  high  a  level  of  abstraction  as  seems  practical. 
As  an  example,  we  will  have  units  to  perform  unspecified  operations  on  vec- 
tors of  a  fixed  size  and  similar  units  which  operate  on  scalars.  The  inter- 
faces between  these  units  and  the  rest  of  the  machine  will  be  limited  to  the 
operands  and  results  and  a  minimum  of  information  specifying  whether  or  not 
the  unit  is  in  a  position  to  accept  operands  and  the  exact  operation  to  per- 
form. This  should  allow  for  the  evolution  of  specialized  hardware. 

As  the  complexity  of  any  design  project  increases,  it  becomes  important 
to  break  the  problem  up  into  a  hierarchy  of  more  tractable  problems.  Further, 
there  are  advantages  to  having  this  structure  reflected  in  physical  units. 
It  would  be  particularly  desirable  if  the  lowest  level  of  structure  could  be 
implemented  on  LSI  chips  making  a  theoretical  design  constraint  compatible 
with  practical  implementation  constraints.  One  result  of  this  research  is 
the  observation  that  a  few  functional  units  are  required  in  many  contexts  and 
might  be  ideal  candidates  for  a  basic  set  of  chips.  Along  with  the  struc- 
tured hierarchy  of  functions,  we  require  simple,  well  defined  interfaces 
between  units.  This  constraint  also  serves  to  keep  the  design  problem 
tractable  and  to  improve  the  chances  for  a  smooth  LSI  adaptation.  Another 
advantage  of  such  a  structure  is  to  ease  hardware  debugging  and  maintenance. 
In  Section  3.1.3.3  we  will  discuss  how  this  structure  could  facilitate  the 
construction  of  a  super  reliable  computer.  A  final  advantage  of  this 


structure  is  that  any  unit  that  meets  the  interface  specifications  can  be 
plugged  in  at  any  time  after  the  machine  is  built.  This  could  allow  for 
some  use  of  new  technologies  in  an  existing  machine  as  well  as  facilitating 
the  addition  of  specialized  hardware  mentioned  earlier.  These  ideas  are 
simply  good  engineering  practice  and  similar  to  those  of  structured  program- 
ming. 


2   OVERALL  STRUCTURE 

In  this  section  we  describe  the  overall  structure  that  evolves  from 
the  objectives  and  techniques  we  have  mentioned.  We  will  first  list  the 
basic  structural  features  and  the  objectives  they  are  intended  to  meet  and 
then  go  on  to  describe  these  features  in  more  detail. 

We  begin  by  discussing  expandability.  The  basic  "computer"  we  design 
will  be  called  a  Computation  Node,  and  we  will  refer  to  external  expandabi- 
lity and  internal  expandability  within  a  node.  External  expandability  refers 
to  the  fact  that  these  nodes  will  be  especially  well  suited  to  being  hooked 
together  as  a  network  of  computers.  We  will  briefly  discuss  this  subject  in 
a  later  chapter.  The  bulk  of  this  thesis  is  concerned  with  the  design  of  a 
single  computation  node.  Internal  expandability  refers  to  the  fact  that  our 
modular  approach  will  allow  for  varying  numbers  of  all  the  major  control 
memory  and  computation  portions  of  the  machine.  This  is  not  fundamentally 
different  from  existing  computers  which  allow  for  varying  numbers  of  CPUs, 
memory  mods,  I/O  channels,  etc.  Our  approach  will  allow  for  a  significantly 
greater  flexibility  in  this  area  than  currently  exists. 

To  allow  for  the  implementation  of  compiler  and  operating  system  func- 
tions in  hardware,  we  will  employ  three  levels  of  machine  languages  ranging 
from  an  APL-like,  high-level  vector  language  to  a  language  which  is  basically 
a  set  of  queue  entries  specifying  physical  machine  addresses  and  types  of 
operations.  These  various  levels  of  machine  language  also  help  in  keeping 
the  design  modular  with  well  defined  interfaces.  These  languages  in  conjunc- 
tion with  an  overall  philosophy  of  having  all  processing  driven  by  local 
queues  and  control  will  ease  implementation  of  multiprogramming  and  multi- 
processing. 


We  have  already  mentioned  that  we  will  employ  a  minimum  parallelism 
of  8.  We  will  extend  the  potential  parallelism  without  restricting  the 
generality.  We  do  this  by  extending  conventional  multiprogramming  and 
multiprocessing  to  allowing  these  functions  at  the  queued  instruction  level, 
In  particular,  the  largest  version  of  this  machine  will  allow  four  programs 
to  be  running  simultaneously,  distributing  their  vector  instructions  to  up 
to  six  8-wide  arithmetic  units.  Of  course,  additional  parallelism  could  be 
obtained  through  the  external  expandability  we  have  mentioned. 

Before  going  on  to  a  more  detailed  description  of  the  machine's  struc- 
ture, some  additional  comments  on  our  objectives  are  in  order.  We  are  not 
attempting  to  provide  maximum  potential  computation  power  at  minimal  cost 
or  even  maximum  usable  computation  power  at  minimum  cost.  Instead  we  are 
considering  what  we  believe  to  be  the  correct  problem,  providing  the  most 
cost  effective  overall  system.  Overall  system  cost  includes  both  the  cost 
of  developing  system  software  and  ultimately  the  cost  of  doing  applications 
programming.  Thus,  much  of  the  structure  we  will  discuss  is  intended  to 
make  the  machine  more  useful  in  this  general  sense. 

Figure  1  gives  the  overall  structure  of  the  Computation  Node,  Figure 
2  gives  the  structure  of  the  Computation  Unit  within  the  Computation  Node, 
and  Table  1  lists  the  units  in  these  figures  and  briefly  describes  their 
functions.  In  order  to  provide  a  general  idea  of  the  operation  of  these 
units,  we  will  provide  a  brief  example.  The  example  will  raise  more  ques- 
tions than  it  answers,  but  it  is  only  intended  to  provide  an  initial  impres- 
sion of  the  functional  structure  we  have  in  mind.  Later  chapters  will 
describe  the  structure  and  function  of  these  units  in  more  detail. 
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TABLE  1   MAJOR  COMPONENT  FUNCTIONS 


Functional  Unit 

Instruction  Unit 
Dispatcher 


Abbreviation 


IUD 


Vector  Execution 
Unit 


VEU 


Scalar  Execution 
Unit 

Vector  Buffer 
Scalar  Buffer 
Vector  Switch 

Scalar  Switch 
Computation  Unit 


Macro  Instruction 
Decoder 


SEU 


MID 


Functions 

(1)  Map  logical  registers  of  OFFL 
onto  the  various  physical  regis- 
ters within  the  computation  unit 
and  by  so  doing  schedule  the  vari- 
ous execution  units. 

(2)  Do  conflict  resolution  between 
the  competing  MIDs. 

(1)  Perform  the  actual  processing 
of  all  vector  operations  of  OFFL. 
These  will  include  arithmetic, 
routing,  and  may  include  special - 
purpose  vector  operations. 

(1)  Perform  the  actual  processing 
of  all  scalar  operations  of  OFFL. 

(1)  Provide  temporary  storage  for 
vectors. 

(1)  Provide  storage  for 
scalars. 

(1)  Provide  paths  for  routing  vec- 
tors between  the  various  vector 
execution  units,  buffers,  primary 
memory,  and  the  MIDs. 

(1)  Provide  paths  for  routing 
scalars  between  the  scalar  execu- 
tion units  and  the  scalar  buffer. 

(1)  Includes  all  of  the  above  func- 
tional units  and  their  intercon- 
nections and  connections  to  the 
external  world. 

(1)  Decomposition  of  macro  instruc- 
tions into  OFFL. 

(2)  Program  control . 

(3)  Initiation  of  page  faults. 

(4)  Anticipatory  I/O. 
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TABLE  1   MAJOR  COMPONENT  FUNCTIONS  (cont.) 


Functional  Unit 
Memory  Manager 


Abbreviation 


Memory  Controller 

Main  Memory 
Backup  Storage 
Computation  Node 


Functions 

(1)  Initiates  request  for  page 
swappings. 

(2)  Assures  that  sufficient  core 
is  available  to  maintain  a  high 
level  of  efficiency. 

(1)  Maps  logical  memory  addresses 
into  physical  addresses. 

(2)  Does  the  actual  addressing  of 
memory. 

(1)  High-speed,  random  access 
storage. 

(1)  All  other  storage  devices  in 
the  machine  not  mentioned  above. 

(1)  All  of  the  above  functional 
units,  their  interconnections  and 
connections  to  the  external  world. 
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Our  example  will  consist  of  a  brief  APL  program  segment.  A,  B,  and 
C  are  vectors  of  length  24;  D,  F,  and  G  are  scalars.  The  program  consists 
of  the  element  by  element  add  of  A  to  B  and  the  dot  product  of  the  result 
with  C.  This  result  is  stored  in  F  and  also  added  to  D,  and  that  result 
stored  in  G.  The  program  for  this  is: 

F  <-  +/CxA+B 
G  «•  F+D 

We  will  refer  to  the  highest  level  vector  language  as  Universal  Assembly 
Language  (UAL).  The  name  is  intended  to  reflect  its  machine-independent 
character.  We  will  describe  how  the  above  is  translated  into  UAL  and  how 
it  is  processed  by  our  machine. 

UAL  basically  consists  of  3  address  instructions.  In  order  to  mini- 
mize the  size  of  these  instructions,  their  operands  will  refer  to  a  small 
group  of  special  registers,  which  will  contain  descriptor  information  for 
the  actual  operands.  Special  2  address  instructions  areused  to  relate 
these  registers  to  program  defined  variables.  The  UAL  for  the  above  will 
be  as  follows: 


Instruction 

Operand 

SETADR 

TfA 

SETADR 

T^B 

ADD 

VYTi 

SETADR 

Vc 

MULTIPLY 

WTi 

Comment 


Register  T-j  now  refers  to  variable  A, 
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Instruction    Operand 


VECSUM 

VT1 

STORE 

VF 

SETADR 

T2«-D 

ADD 

WT1 

STORE 

T1^G 

Comment 

T^  now  refers  to  the  sum  of  the  com- 
ponents of  T,  before  this  statement. 

T1  is  not  used  here  because  its  value 
is  needed  later. 


An  MID  must  translate  these  UAL  instructions  into  a  machine-dependent 
Operand  Fixed  Format  Language  (OFFL).  This  name  is  intended  to  refer  to  the 
fact  that  this  language  explicitly  recognizes  the  vector  width  of  the 
machine.  The  IUD  must  in  turn  translate  OFFL  into  a  sequence  of  queue 
entries  within  the  Computation  Unit.  These  queue  entries  will  ultimately 
result  in  a  logically  correct  execution  of  the  code.  Figure  3  shows  the 
tree  for  this  program.  Table  2  shows  what  the  OFFL  instructions  might  look 
like.  In  constructing  this  table  and  figure,  we  have  chosen  temporary  regis- 
ter locations  to  show  how  they  can  be  used  to  control  the  sequencing  of  the 
program.  This  is  the  basis  of  the  method  by  which  the  hardware  permutes  the 
order  of  execution  of  instructions  in  any  way  that  optimizes  resource  utili- 
zation without  affecting  the  logical  outcome  of  the  code.  In  particular, 
the  MID  has  a  small  number  of  logical  temporary  registers  available  to  it. 
These  refer  to  8-word  wide  vectors.  It  uses  these  to  break  up  instructions 
operating  on  arbitrary  sized  vectors  in  UAL  into  OFFL  instructions.  As  soon 
as  all  instructions  that  use  one  of  these  temporaries  has  been  generated, 
that  temporary  may  be  reused.  The  IUD  assigns  physical  register  locations 
to  these  logical  locations.  It  does  this  in  a  way  that  allows  for  the 
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TABLE  2   OFFL  PROGRAM 


Instruction 

Operand 

LOAD 

TH-7 

LOAD 

TH-7 

VECADD 

tTK 

LOAD 

T2^0-7 

VECMUL 

*TS"! 

LOAD 

3  Vl5 

LOAD 

TH-15 

VECADD 

fW 

LOAD 

T^VlS 

VECMUL 

&M 

VECADD 

TI+TK 

LOAD 

Tv«-A 

'5  M15-23 

LOAD 

T6"B15-23 

VECADD 

WW 

LOAD 

'6^15-23 

VECMUL 

WT5 

VECADD 

W[ 

MOVE 

^1-7 

Comment 

T  refers  to  vector  registers. 


Vector  addition. 

There  is  no  reason  not  to  reuse  tY  and 

T2  in  this  and  the  previous  statement. 

Vector  multiplication. 

We  do  not  reuse  T1  or  T2  here  so  we  can 

overlap  the  following  sequence  of  five 

statements  with  the  above. 


This  add  cannot  be  overlapped  so  there  is 
no  reason  not  to  reuse  T, . 

Move  the  result  vector  into  8  separate 
scalars  to  complete  the  summation. 
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TABLE  2   OFFL  PROGRAM  (cont.) 


Instruction 

Operand 

SADD 

'0  1  !8 

SADD 

TS+TS->-TS 
1 2  ' 3  '9 

SADD 

ts+tsvts 
'8  '9  '8 

SADD 

1 4  ' 5  '  10 

SADD 

T6+T7^T11 

SADD 

T?8TifT?o 

SADD 

T8+T10"T0 

STORE 

%* 

LOAD 

Ti*D 

SADD 

T0+Tl+T0 

STORE 

Tg* 

Comment 

Scalar  addition. 


At  this  point  all  scalar  temporaries  except 
Tq  are  available. 
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LEVEL 


FIGURE  3   PROGRAM  TREE 
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maximum  possible  parallelism.  In  particular,  the  reuse  of  a  logical 
temporary  in  OFFL  will  always  be  assigned  a  different  physical  location. 
If  this  were  not  the  case,  the  store  to  this  temporary  would  have  to  wait 
until  all  loads  from  it  requiring  its  earlier  value  have  completed.  Thus, 
the  assignment  of  temporaries  in  our  example  reflects  the  way  in  which  the 
IUD  might  assign  physical  registers.  The  MID  can  be  more  careless  about 
reassigning  logical  registers,  because  the  IUD  operates  as  just  described. 
The  above  only  applies  to  vector  instructions.  Scalars  are  handled  in  a 
different  way  but  with  essentially  the  same  result.  There  can  exist  several 
copies  of  the  same  "physical"  scalar  at  the  same  time.  Associative  tables 
and  a  time  indexing  scheme  keep  them  straight.  Parallelism  in  Figure  3 
could  be  utilized  by  our  machine. 
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3   BASIC  BUILDING  BLOCKS  AND  DESIGN  TECHNIQUES 

In  the  process  of  designing  a  family  of  machines  to  meet  the  objectives 
discussed  previously,  we  found  ourselves  using  the  same  sort  of  functional 
units  and  the  same  design  techniques  in  many  different  contexts.  Before  we 
describe  this  detailed  design  work,  we  will  provide  a  generalized  discussion 
of  the  basic  building  blocks  and  techniques  which  we  have  evolved.  We  regard 
this  set  of  very   high-level  building  blocks  as  significant.  The  cost  of  in- 
tegrated circuits  is  much  more  critically  dependent  on  the  number  of  circuits 
ultimately  to  be  produced  than  on  the  number  of  gates  in  the  circuit.  Thus, 
if  one  can  establish  a  canonical  set  of  ICs  from  which  a  broad  class  of  com- 
puters can  be  constructed,  one  will  be  able  to  keep  the  cost  of  the  computers 
themselves  down.  In  this  thesis  we  have  only  undertaken  the  first  step  in 
the  complex  process  that  could  ultimately  lead  to  the  fabrication  of  such  a 
canonical  set  of  ICs.  That  step  is  the  recognition  of  the  functional  simi- 
larity of  many  of  the  units  we  will  construct.  We  have  made  no  attempt  to 
provide  sets  of  logical  designs  that  will  be  of  universal  validity  for  the 
various  function  types.  Such  a  process  would  be  desirable  if  one  were  plan- 
ning to  construct  machines  of  the  sort  we  have  designed.  Section  3.1  is  a 
discussion  of  these  basic  building  blocks  as  well  as  how  they  can  be  combined 
to  form  more  complex  blocks.  The  resulting  structures  are  in  some  ways  simi- 
lar to  a  well  designed  program  made  up  of  small  subroutines  existing  at  many 
levels  in  an  overall  hierarchy. 

The  techniques  of  pipelining  and  parallelism  are  well  known  and  widely 
used.  They  are  in  a  fairly  primitive  state  of  development.  Using  the  build- 
ing blocks  just  mentioned,  we  have  applied  these  techniques  in  a  somewhat 
systematic  way  in  the  course  of  doing  detailed  design.  Section  3.2  is  a 
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description  of  the  various  ways  in  which  we  use  pipelining  and  parallelism 
to  achieve  our  objectives.  Section  3.3  is  a  theoretical  analysis  of  paral- 
lelism. Its  main  purpose  is  to  provide  some  perspective  on  the  dimensions 
of  this  field  and  to  suggest  some  unconventional  approaches  to  gaining 
greater  understanding  of  this  subject. 
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3.1   BUILDING  BLOCKS 

In  this  section  we  first  discuss  the  motivation  for  structuring  the 
machine  as  we  have.  We  then  discuss  the  lowest  level  of  building  blocks 
or  functional  units  from  which  the  entire  machine  is  constructed.  Finally, 
we  discuss  how  more  global  units  are  built  up  from  these  basic  units. 

3.1 .1  Motivation 

The  building  blocks  we  have  chosen  arise  from  three  aspects  of  our 
overall  approach.  These  are  our  attempt  to  provide  local  distributed  con- 
trol, hardware  implementation  of  compiler  and  operating  system  functions, 
and  parallelism  itself.  The  building  blocks  these  give  rise  to  are:  queues, 
controls,  switches,  access  controllers,  and  descriptive  tables.  We  will  de- 
fine each  of  these  units  and  relate  them  to  the  three  aspects  of  our  approach 
in  the  next  section. 

3.1.2  Basic  Building  Blocks 

We  now  describe  the  basic  building  blocks.  These  differ  from  the  pri- 
mitives in  Bell  and  Newel l's  PMS  notation.  They  are  not  basic  components 
from  which  any  computer  can  be  constructed.  They  are  fairly  complex  units. 
They  also  differ  from  the  various  boxes  that  one  inevitably  draws  when  dis- 
cussing conventional  computers.  They  are  more  primitive  in  the  sense  that 
they  occur  repeatedly  in  many  different  contexts  in  the  overall  design. 

3.1.2.1   Queues 

Conceptually,  a  queue  is  a  linearly  ordered  list  of  requests  for  re- 
sources. Our  basic  building  block  queue  arises  from  the  need  to  provide 
local  control.  By  allowing  hardware  to  determine  the  sequence  of  instruc- 
tion execution,  we  can  allow  for  more  efficient  utilization  of  resources. 
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To  provide  for  this,  our  queues  attempt  to  provide  first  in  first  out 
service.  They  also  attempt  to  keep  the  unit  they  drive  as  active  as  possi- 
ble. The  algorithm  to  meet  these  objectives  will  be  to  examine  the  oldest 
entry  in  the  queue  first  and  then  earlier  entries  until  one  is  found  for 
which  all  required  resources  are  available.  Such  FIRFO  queues  will  be  used 
to  drive  the  vector  and  scalar  execution  units  and  the  vector  switch.  The 
control  of  the  queue  itself,  the  testing  of  outside  resources,  and  the  ac- 
tual decisions  about  what  action  to  take  will  involve  other  components. 
The  queue  is  a  memory  that  allows  for  the  access  of  entries  at  the  data 
rates  and  in  the  sequence  required  for  implementing  the  above  functions. 
It  also  allows  new  queue  entries  to  be  made  at  the  data  rate  required. 

3.1.2.2   Controls 

A  control  is  a  unit  capable  of  sensing  various  states  of  its  environ- 
ment and  initiating  actions  based  on  this  information.  The  most  general 
sort  of  control  would  be  a  full  scale  Turing  Machine  with  I/O.  Our  con- 
trols will  not  correspond  to  the  standard  definition  in  that  memory  of 
previous  states  will  not  reside  in  the  control  itself,  but  will  always  be 
in  queues  and  status  tables  which  the  control  can  interrogate.  Controls 
in  general  will  need  to  respond  very   quickly,  often  within  a  few  gate  de- 
lays. Thus,  controls  will  consist  of  some  combinatorial  logic  with 
sequencing  circuits  driving  the  unit.  They  can  range  from  a  very   simple 
to  rather  complex  combinatorial  circuit  driven  by  the  machine  clock.  They 
will  be  used  throughout  the  machine,  driving  switches,  using  queues  to 
sequence  various  units,  determining  the  operation  of  the  IUD  pipe  at  all 
stages  within  it,  and  in  general  keeping  the  machine  operating  by  indivi- 
dually keeping  each  of  its  parts  going. 
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3.1.2.3  Switches 

The  process  of  transferring  data  and  instructions  to  the  units  requir- 
ing them  will  frequently  involve  the  use  of  small  crossbar  switches.  In 
general  we  will  confine  the  use  of  the  word  switches  to  refer  to  just  such 
units  used  for  such  purposes.  We  will  use  the  term  routers  to  refer  to  units 
which  sort  data  under  program  control . 

3.1.2.4  Access  Controllers 

Many  of  the  resources  of  this  machine  can  service  different  controls. 
Access  controllers  referee  this  competition.  They  may  provide  some  priority 
scheduling  scheme  and  usually  have  some  memory  of  the  previous  allocations 
of  the  resource  they  referee.  This  memory  allows  them  to  ensure  that  no 
requesting  unit  can  be  totally  locked  out. 

3.1.2.5  Descriptive  Tables 

A  descriptive  table  is  simply  a  memory  that  contains  information  about 
the  current  state  of  the  machine  and  about  programs  that  are  executing.  One 
traditional  use  of  such  tables  that  will  occur  frequently  in  our  design  will 
be  tables  which  provide  information  about  the  status  of  registers  or  buffer 
memories.  The  difference  between  tables  and  queues  is  the  method  of  access. 
Tables  may  be  either  associative  or  addressable  memories.  They  will  some- 
times be  accessible  by  either  method.  In  addition  the  high  data  rates 
required  in  some  instances  may  necessitate  the  ability  for  multiple  simul- 
taneous access. 

3.1.2.6  Traditional  Components 

All  of  the  components  we  have  discussed  arise  from  existing  concepts. 
Our  definitions  have  been  restricted  in  ways  to  suit  our  purposes. 
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Two  additional  components  we  will  use  in  a  more  or  less  completely  tradi- 
tional manner  are  memories  and  computation  units.  Memories  in  addition  to 
the  specialized  ones  we  have  discussed  will  exist  in  many  sizes  and  types 
throughout  the  machine.  Memories  are  unlike  descriptive  tables  in  that 
they  contain  data  or  program  instructions.  They  will  not  be  associative 
memories,  but  will  allow  for  access  in  various  ways  to  suit  their  purpose. 

Computation  units  will  be  arithmetic  units  and  in  general  the  hardware 
that  actually  does  some  useful  work.  They  will  be  both  vector  and  scalar 
units  and  will  include  routers  whose  purpose  is  to  sort  data.  We  will  not 
propose  any  new  designs  for  these  units.  Our  only  innovation  in  this  area 
is  the  provision  for  plugging  such  units  into  an  existing  machine  without 
significant  hardware  or  software  modifications. 

3.1.3   Block  Interfaces 

In  this  section  we  discuss  the  timing  aspects  of  block  interfaces.  The 
structure  of  the  basic  blocks  just  discussed  as  well  as  the  structure  at  more 
complex  organizational  levels  determine  the  detailed  nature  of  the  interfaces. 
In  general,  the  more  complex  the  units,  the  greater  the  communication  delay; 
this  is  both  necessary  and  tolerable.  We  will  describe  the  various  timing 
levels  we  use  and  the  techniques  for  facilitating  the  timing  structure.  We 
will  describe  how  this  relates  to  pipelining  and  parallelism.  Finally,  we 
will  describe  some  special  advantages  of  the  timing  structure. 

3.1.3.1   Timing  Structure 

We  will  employ  a  minor  clock  and  major  clock.  The  time  for  the  minor 
clock  will  be  roughly  that  corresponding  to  8  levels  of  logic.  The  major 
clock  will  be  8  minor  clocks  in  duration.  The  minor  clock  will  be  used 
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within  major  structural  units;  the  major  clock  will  be  used  for  transfers 
between  major  structural  units.  The  levels  of  logic  for  the  minor  clock 
evolved  naturally  out  of  the  detailed  design  work.  The  ratio  of  major  to 
minor  clocks  is  directly  related  to  the  vector  width  of  the  machine.  Most 
data  transfers  both  within  and  between  units  will  be  pipelined  at  the  rate 
of  one  word  per  minor  clock.  This  two-level  clock  structure  seems  a  natural 
way  to  ameliorate  some  of  the  problems  associated  with  very   fast  clocks.  We 
do  not  need  to  specify  a  physical  time  for  the  clock  period.  This  would  be 
a  function  of  the  physical  size  and  speed  of  the  logic  used.  With  current 
technology,  we  could  assume  2  ns  gate  delays.  With  8  levels  of  logic,  we 
have  up  to  16  gate  delays  or  roughly  a  50  ns  clock.  This  estimate  may  be  a 
bit  optimistic  since  one  would  not  want  to  implement  this  design  with  the 
fastest  logic  possible,  \lery   fast  logic  implies  high  power  dissipation  and 
comparatively  small  gate  densitites.  Both  of  these  constraints  would  be  ex- 
tremely troublesome  in  a  machine  with  yery   high  gate  counts.  On  the  other 
hand,  faster  logic  with  less  power  consumption  does  seem  to  be  in  the  offing. 
The  one  constraint  on  machine  speed  that  is  not  likely  to  change  is  the  delay 
times  for  signal  transmission,  i.e.,  the  speed  of  light.  Our  two-level  clock 
will  be  especially  useful  in  accommodating  this  fact  of  life.  Our  basic 
approach  is  not  geared  to  developing  optimal  techniques  for  a  current  techno- 
logy, but  rather  for  developing  techniques  that  will  become  increasingly 
attractive  in  the  near  future  given  the  direction  that  technology  is  moving. 
There  are  structural  problems  inherent  in  our  two  clock  levels,  and  we 
will  now  describe  how  we  deal  with  these.  We  can  think  of  the  clocks  as 
being  centrally  located,  synchronized,  and  broadcasting  pulses  to  all  parts 
of  the  machine.  Major  components  must  be  constructed  to  operate  internally 
at  the  fast  clock  rate.  The  interfaces  between  these  must  operate  at  the 
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slower  clock  rate,  and  special  Interfacing  logic  will  be  designed  so  that 
the  major  units  can  internally  behave  as  if  there  were  a  single  fast  clock. 
We  will  not  specify  absolutely  what  constitutes  a  major  component  since  this 
is  a  technology  dependent  decision.  We  will  indicate  in  the  process  of 
doing  detailed  logical  design  the  various  levels  at  which  the  machine  can 
and  cannot  be  partitioned. 

Major  components  will  be  transmitting  both  data  and  control  information 
among  themselves  at  the  major  clock  rate.  We  would  like  to  minimize  the 
width  of  the  paths  and  thus,  if  possible,  pipeline  transmission  on  them  at 
the  minor  clock  rate.  This  can  be  done  without  losing  the  advantage  of  the 
longer  clock  rate  between  major  components.  The  sending  unit  will  transmit 
its  output  at  the  minor  clock  rate  and  will  also  send  a  copy  of  its  clock 
pulse  in  parallel  with  the  information.  Provided  all  parallel  paths  are  the 
same  length,  this  pulse  can  be  used  for  reading  the  transmission  line.  We 
can  design  a  single  interface  that  accepts  input  from  such  a  line  and  the 
clock  pulse  of  the  receiving  unit  and  that  inputs  data  to  the  receiving  unit 
according  to  its  clocking.  This  can  be  accomplished  by  a  simple  circular 
buffering  technique  in  which  registers  are  written  with  one  timing  pulse  and 
are  read  by  the  other  Dulse.  Since  both  pulses  originate  from  the  same  mas- 
ter clock,  there  will  be  a  constant  phase  difference  between  them. 

There  is  an  additional  buffering  problem  associated  with  this  timing 
structure  that  our  interface  will  handle.  Not  all  of  the  units  can  neces- 
sarily process  an  arbitrary  input  stream  at  the  maximum  possible  rate.  There 
is  usually  some  internal  buffering  to  minimize  the  effects  of  transients, 
but  it  is  possible  for  these  buffers  to  become  full.  Just  prior  to  this 
occurring,  the  receiving  unit  must  notify  its  interface  to  stop  transmitting 
information.  Because  of  the  long  delays  possible  between  major  units,  as 
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long  as  two  major  clocks  worth  of  information  may  be  transmitted  before 
the  interface  has  a  chance  to  notify  the  sending  unit  to  halt.  The  inter- 
face must  buffer  this  amount  of  information.   It  would  be  possible  to  pro- 
vide a  single  design  for  such  units  with  varying  capacities  and  use  them 
at  all  long  delay  interfaces. 

3.1.3.2  Pipeline  or  Parallel  Units 

The  timing  structure  we  have  just  described  is  well  suited  to  either 
pipeline  or  parallel  execution  units  or  a  combination  of  both  types.  We 
have  established  that  transfers  between  units  will  be  structured  in  a  pipe- 
lined manner  to  minimize  interconnections.  This  structure  is  perfectly 
suited  to  full  parallel  operation.  We  need  only  introduce  an  8-word  wide 
buffer  to  accumulate  the  pipelined  inputs  in  preparation  for  their  access 
by  a  fully  parallel  execution  unit.  It  is  assumed  that  the  execution  time 
of  a  fully  parallel  unit  would  be  at  least  8  times  the  minor  clock  rate. 
The  interconnections  are  suitable  for  direct  input  to  a  pipeline  processor. 
If  there  is  a  set-up  time  associated  with  the  processor,  then  changing  to  a 
different  operation  might  require  some  buffering  similar  to  that  for  a  paral- 
lel unit.  On  a  machine  of  the  nature  we  are  constructing,  a  pipe  with  long 
set-up  time  for  general -purpose  arithmetic  would  be  impractical.  On  the 
other  hand,  such  a  pipe  might  be  desirable  for  some  specialized  processing 
unit  designed  for  a  specific  function  in  a  specific  algorithm.  The  timing 
structure  provides  substantial  but  not  unlimited  flexibility  in  choosing  the 
structure  of  computational  hardware. 

3.1.3.3  Additional  Advantages  of  the  Interconnection  and  Timing  Structures 
The  interconnection  structure  we  have  described  is  particularly  well 

suited  to  error  correction  and  detection  techniques  and  to  performance 
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monitoring.  We  regard  both  of  these  functions  as  being  particularly  impor- 
tant, given  our  overall  approach.  The  larger  the  gate  count  of  a  machine, 
the  lower  the  MTBF.  In  addition,  the  more  costly  a  machine  is,  the  more 
expensive  down-time  is  both  directly  in  terms  of  lost  computer  time  and, 
even  worse,  the  delay  in  projects  dependent  on  the  machine.  The  most  serious 
obstacle  to  effectively  using  ILLIAC  IV  is  a  yery   small  MTBF  combined  with 
something  like  an  hour  required  to  isolate  a  failing  PE,  replace  it,  and 
verify  that  no  new  errors  have  been  introduced  in  the  process.  Fixing  the 
CU  requires  several  hours  to  days.  To  some  degree,  these  problems  are  un- 
doubtedly due  to  the  decision  to  build  ILLIAC  out  of  the  fastest  logic 
available  instead  of  logic  that  has  been  more  highly  developed  and  is  better 
understood.  Larger  scale  integration  is  likely  to  significantly  improve  cir- 
cuit reliability  [9J.  Nonetheless,  in  constructing  computers  with  very  large 
gate  counts,  reliability  problems  inevitably  increase.  Providing  a  design 
which  allows  for  the  easy  addition  of  architectural  features  that  improve 
reliability  is  an  extremely  desirable  feature  for  a  paper  machine  of  the  sort 
we  are  proposing.  It  allows  decisions  to  be  made  after  hard  information  on 
reliability  has  been  obtained.  It  also  allows  for  the  construction  of  ma- 
chines with  various  cost-versus-reliability  tradeoffs.  Since  many  of  the 
applications  for  large  computers  involve  real-time  applications,  there  may 
exist  a  need  for  super-reliable  versions  of  such  machines. 

Performance  monitoring  is  important  for  any  complex,  expensive  system. 
Complexity  inevitably  implies  that  deep  analytical  understanding  becomes  ex- 
tremely difficult  and  expensive  to  obtain  or  simply  impossible.  Many  exist- 
ing computer  architectures  have  reached  the  point  where  such  understanding 
is  at  least  pragmatically  impossible  to  obtain.  Our  design  has  evolved  from 
existing  concepts  but  is  significantly  different  from  and  more  complex  than 
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existing  systems.  Thus  we  are  in  a  position  where  analytical  understanding 
is  impossible  and  experience  with  existing  machines  is  inadequate.  Provid- 
ing detailed  critical  information  on  performance  in  prototype  models  would 
be  a  mandatory  requirement  for  perfecting  the  design  concepts  we  are  de- 
scribing. Providing  such  information  in  existing  systems  would  be  extremely 
important  in  developing  operating  systems  and  compilers.  Because  of  our 
modular  structure,  hardware  improvements  would  also  be  possible  at  this 
stage.  Finally,  such  monitoring  would  provide  excellent  feedback  on  what 
constitutes  good  programming  techniques  for  this  architecture. 

We  will  now  describe  how  improved  reliability  and  performance  monitoring 
are  obtainable  from  our  structure. 

3.1.3.3.1   Error  Detection  and  Correction 

A  traditional  expensive  but  simple  method  of  providing  error  detection 
or  correction  is  with  replicated  hardware.  Providing  duplicates  for  all 
components  allows  detection  of  errors  as  discrepancies  in  the  outputs.  Error 
correction  is  provided  by  triplicated  hardware  and  majority  vote  when  an  er- 
ror is  encountered.  In  the  case  of  main  memory  and  other  back-up  memory 
devices,  we  would  propose  that  single  error  correction,  double  error  detec- 
tion codes  be  used.  This  provides  protection  essentially  equivalent  to 
triple  redundancy  at  a  modest  cost  in  additional  logic.  We  propose  the  more 
costly  triple  redundancy  for  the  remainder  of  the  logic  because  of  its  sim- 
plicity and  suitability  for  the  structure.  In  particular,  the  triple  redun- 
dancy can  be  provided  at  the  major  component  level.  We  can  enhance  the 
interface  units  we  have  described  to  include  the  error  detection  correction 
function.  This  could  be  done  without  imposing  more  than  a  couple  additional 
minor  clock  delays  in  actual  processing.  The  delays  needed  to  synchronize 
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the  two  or  three  imput  signals  may  impose  some  additional  delay  as  a  func- 
tion of  the  physical  layout  of  the  machine. 

One  major  advantage  of  this  approach  would  be  the  ability  to  do  real- 
time, automated  error  isolation.  Upon  detecting  an  error,  the  operating 
system  could  notify  the  operator  to  "Please  replace  Module  X12  in  Cabinet  5, 
Rack  4  with  Part  210Z  in  Storage  Cabine  3,  Shelf  C."  In  double  redundancy, 
the  operating  system  could,  in  most  instances,  lock  out  the  affected  unit, 
restart  any  affected  program,  and  continue  operation  with  somewhat  reduced 
capabilities.  In  the  case  of  triple  redundancy,  no  error  should  be  intro- 
duced in  any  running  program,  and  no  reduction  in  capacity  would  occur. 
Clearly,  if  the  MTBF  of  the  individual  components  is  at  a  reasonable  level, 
the  overall  system  could  approach  100  percent  reliability  and  availability. 
In  addition,  maintenance  and  repair  in  almost  all  cases  could  be  done  by 
unskilled  personnel.  Of  course,  repair  of  the  individual  modules  would  re- 
quire different  approaches,  but  this  can  be  done  in  a  leisurely  manner  if  a 
reasonable  inventory  of  spares  is  maintained. 

By  associating  the  error  correction  function  with  the  interface  unit, 
we  can  essentially  eliminate  the  problem  of  who  referees  the  referees.  Fig- 
ure 4  illustrates  this  structure.  A,  B,  C,  and  D  refer  to  different  func- 
tional units.  The  subscripts  refer  to  the  three  copies  of  each  unit.  There 
is  an  interface  for  each  copy  of  each  unit  and  each  connection  to  a  unit. 
The  interfaces  are  labeled  with  the  source  name  followed  by  the  name  of  the 
particular  copy  of  the  destination  unit.  For  error  isolation  purposes,  all 
interfaces  are  to  be  considered  as  part  of  the  unit  they  input  to.  Thus,  if 
there  is  an  error  in  ABQ,  this  will  affect  the  outputs  of  BQ  and  will  be 
detected  by  BDQ,  BD, ,  and  BDp.  All  three  will  signal  the  operating  system 
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FIGURE  4   ERROR  CORRECTION  CONNECTIONS 
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that  there  is  an  error  in  BQ  which  includes  the  interface  ABQ.  From  that 
point  on  until  it  is  replaced,  the  outputs  of  BQ  will  be  ignored. 

3.1.3.3.2   Hardware  Performance  Monitoring 

The  interfaces  also  provide  an  obvious  source  of  information  for  very 
detailed  performance  monitoring.  It  would  be  practical  to  include  within 
each  interface  a  microcomputer  to  monitor  the  communication  and  selectively 
transmit  information  to  a  central  performance  monitoring  system.  If  facili- 
ties were  provided  for  altering  the  programming  of  these  microcomputers, 
this  structure  would  provide  an  extremely  powerful  and  flexible  system.  As 
we  have  already  mentioned,  the  somewhat  radical  and  very   complex  nature  of 
the  structure  we  are  proposing  makes  such  a  facility  extremely  desirable  if 
not  essential.  It  is  our  belief  that  as  computers  become  more  complex,  real- 
time performance  monitoring  will  become  an  essential  element  in  the  feedback 
loop  that  should  lead  to  "better"  computers. 
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3.2   GENERAL  DISCUSSION  OF  DESIGN  TECHNIQUES 

We  will  somewhat  arbitrarily  divide  this  discussion  of  techniques  into 
a  discussion  of  pipelining  and  parallelism  analysis,  and  a  discussion  of 
techniques  for  reading  and  updating  descriptive  tables.  The  former  refers 
to  a  flow  analysis  used  to  determine  the  degree  of  parallelism  and  pipelin- 
ing required  to  insure  that  all  components  of  the  machine  are  able  to  keep 
up  with  each  other.  It  also  refers  to  the  queueing  techniques  used  to  smooth 
the  flow  between  units.  The  processing  of  descriptive  tables  refers  to  the 
algorithms  for  maintaining  an  adequate  description  of  the  state  of  the 
machine  and  programs.  Our  hardware  implementation  of  operating  system  and 
compiler  functions  and  our  queueing  techniques  require  hardware  maintenance 
of  some  fairly  sophisticated  tables. 

Before  we  describe  these  techniques  in  detail,  we  need  to  say  a  few 
words  about  our  overall  approach  to  pipelining  and  parallelism.  More  speci- 
fically, we  will  discuss  what  we  consider  to  be  the  primary  obstacle  to 
effectively  utilizing  these  techniques.  This  is  program  nondeterminism.  We 
have  kept  the  parallelism  of  individual  computation  units  small  enough  that 
we  know  most  programs  can,  in  theory,  effectively  use  it.  To  turn  this  theo- 
retical possibility  into  a  practical  reality  and  for  other  reasons  which  we 
have  discussed,  we  will  construct  an  elaborate  system  for  hardware  control  of 
the  individual  execution  units.  The  operation  of  this  analysis  and  control 
hardware  must  be  overlapped  with  actual  program  execution.  This  implies  a 
pipeline  structure.  The  more  complex  the  analysis,  the  longer  this  pipe  must 
be.  We  will  do  a  flow  analysis  in  the  process  of  doing  detailed  design, 
which  should  insure  that  the  machine  will  be  operating  efficiently  as  long 
as  instructions  keep  flowing  in  at  the  head  of  the  pipe.  Conditional  trans- 
fers can  break  up  this  flow  and  have  a  devastating  effect  on  overall 
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efficiency.  There  are  several  aspects  of  our  approach  that  will  minimize 
this  problem.  The  problem  cannot  be  eliminated  for  all  programs. 

In  Section  3.3  we  discuss  parallelism  from  a  completely  abstract 
perspective.  In  particular,  we  will  discuss  the  general  question  of  the 
structure  of  algorithms  and  transformations  to  map  them  onto  various 
parallel  computing  structures.  For  now  we  simply  concede  that  there  are 
some  algorithms  poorly  suited  to  the  structure  we  will  develop.  We  believe 
this  group  of  algorithms  is  a  quite  small  percentage  of  all  useful  algo- 
rithms. We  will  now  describe  how  the  problems  associated  with  conditional 
transfers  can  be  overcome  for  most  algorithms. 

We  have  three  complementary  approaches  to  this  problem.  These  are  use 
of  an  if  tree  analyzer,  compilation  and  execution  time  analysis  of  flow  of 
control,  and  instruction  level  multiprogramming.  In  analyzing  FORTRAN  pro- 
grams for  parallelism  [5],  it  was  determined  that  there  are  "bursts"  of 
assignment,  go  to,  and  if  statements.  Special  hardware  has  been  designed 
[3]  to  process  such  if  nodes  in  parallel.  This  results  in  converting  a 
sequence  of  nondeterministic  nodes  to  a  single  nondeterministic  node.  Such 
an  execution  unit  can  be  included  in  the  vector  portion  of  our  machine. 

In  referring  to  analysis  of  flow  of  control,  we  have  in  mind  differ- 
entiating deterministic  program  loops  from  true  program  nondeterminism. 
The  critical  parameter  is  the  time  between  when  it  is  known  which  alternative 
of  a  branch  must  be  taken  and  the  moment  when  the  branch  occurs.  Counting 
loops  with  a  limit  computed  outside  of  them  can  be  made  completely  determin- 
istic. The  compiler  can  recognize  this  situation,  and  the  MID  can  be  con- 
structed to  use  this  knowledge.  In  general  the  compiler  can  attempt  to  move 
any  branch  dependent  computation  as  far  ahead  of  the  branch  as  possible. 
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Back  substitution  and  the  introduction  of  redundant  computations  could  be 
used  in  some  instances.  We  will  discuss  these  alternatives  in  more  detail 
in  Chapter  5. 

Instruction  level  multiprogramming  refers  to  the  fact  that  we  have  up 
to  four  MIDs  simultaneously  processing  different  programs  or  different  paral 
lei  paths  of  the  same  program.  These  r  1 1Ds  simply  load  various  queues  which 
drive  memory  and  other  resources.  If  one  of  the  MIDs  is  held  up,  it  simply 
stops  feeding  the  queues.  Data  rates  are  such  that  the  other  MIDs  can  take 
up  the  slack.   In  fact,  one  MID  is  capable  of  fully  utilizing  the  computing 
resources.  Further,  no  instruction  ever  gets  past  the  MID  unless  it  can 
proceed  to  completion.  In  particular,  all  operands  must  be  in  memory. 
Thus,  in  multiprogramming  mode,  if  the  nondeterministic  branches  are  rela- 
tively sparse,  utilization  can  approach  100  percent. 

There  are  algorithms  with  a  very   high  level  of  program  nondeterminism, 
and  they  would  not  be  well  suited  to  an  architecture  of  the  sort  we  are 
proposing.  However,  we  do  believe  that  most  programs  will  be  able  to  run 
efficiently  on  our  machine.  As  one  measure  of  the  level  of  nondeterminism 
in  programs,  we  can  examine  some  of  the  parameters  measures  in  the  analysis 
of  FORTRAN  programs  [5].  Attempting  to  speed  up  a  highly  nondeterministic 
program  with  parallel  execution  inevitably  results  in  \/ery   poor  efficiency 
compared  to  executing  the  same  program  on  a  serial  machine.  Yet,  it  was 
possible  to  maintain  an  efficiency  of  0.3  to  0.4  over  a  broad  class  of 
programs  while  using,  in  almost  all  cases,  more  than  16  parallel  units  and, 
in  the  majority  of  cases,  more  than  30. 
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3.2.1   Pipeline  and  Parallel  Design  Techniques 

The  problem  of  designing  the  IUD  was  to  construct  in  logic  an  algo- 
rithm for  carrying  on  a  fairly  complex  set  of  scheduling  tasks.  We  will 
outline  here  our  general  approach  to  the  problem.  The  actual  performing 
of  computation  is  controlled  by  FIRFO  queue  driven  units  which  accept  queue 
entries  furnished  by  the  IUD.  We  will  describe  in  general  terms  the  opera- 
tion of  these  queue  driven  units. 

3.2.1.1   IUD  Design  Analysis 

The  steps  taken  in  designing  the  IUD  were  as  follows: 

1.  Compute  the  instruction  emergence  rate  required  to  keep  the 
rest  of  the  machine  active. 

2.  List  the  functions  the  IUD  was  required  to  perform.  Estimate 
how  long  each  of  these  would  take  and  list  other  functions  they 
may  be  dependent  on.   (Table  24) 

3.  Make  an  IUD  pipeline  diagram  giving  time  versus  function(s) 
performed.   (Table  25) 

4.  Do  detailed  logical  design  of  each  of  the  functional  units. 

If  any  of  the  units  cannot  be  designed  to  meet  the  estimates  of 
step  2,  modify  the  pipe  diagram  of  3  appropriately. 
In  performing  the  internal  design  of  the  various  units,  a  similar 
approach  was  applied  in  a  less  systematic  way.  For  the  most  part,  this  pro- 
cess worked  fairly  well.  Like  any  moderately  complex  subroutine  in  a  com- 
puter program,  we  are  quite  certain  that  any  of  our  individual  designs 
could  be  improved  upon  by  additional  work.  The  final  structure  of  the  IUD 
pipe  did  turn  out  to  be  significantly  different  than  the  initial  diagram 
we  constructed. 
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The  IUD  processes  instructions  for  all  the  various  computation  and 
memory  units.  In  the  process  of  doing  the  design,  it  was  noted  that  the 
scalar  instructions  could  be  processed  for  the  most  part  independently 
of  the  other  instructions.  There  was  a  definite  advantage  to  doing  this 
processing  independently  after  the  instructions  emerged  from  the  main  IUD 
pipe.  The  instructions  could  be  processed  at  the  maximum  rate  for  scalar 
instructions  as  opposed  to  the  maximum  rate  for  all  types  of  instructions 
at  a  considerable  savings  in  hardware.  In  Section  4.6  we  describe  both 
our  original  structure  and  how  it  was  modified  in  the  course  of  design. 

The  one  unit  in  the  IUD  pipe  that  did  not  quite  meet  our  time  con- 
straint of  8  levels  of  logic  was  the  unit  that  allocated  functionally 
equivalent  VEUs.  This  unit  is  designed  in  Section  4.6.3.2.7  and  is  probably 
the  kludgiest  of  any  of  the  units  we  designed.  We  choose  to  discuss  that 
design  here,  not  out  of  masochism,  but  because  we  learned  the  most  in  con- 
structing that  unit. 

In  the  process  of  designing  a  portion  of  tnis  unit,  we  developed  a 
systematic  notation  for  generalizing  the  techniques  used  to  construct  a 
carry  save  save  adder.  Of  course  the  technique  is  not  likely  to  be  appli- 
cable to  all  problems  of  speeding  up  logical  circuits.  Further,  the  sys- 
tematic portion  of  the  procedure  is  the  notation.  The  notation  must  be 
applied  in  an  intelligent  and  sometimes  imaginative  way  to  provide  a  high- 
speed logical  design  for  a  specific  functional  unit.  Nonetheless,  the 
notation  described  in  Section  4.6.3.2.7  and  used  in  the  appendix  does  seem 
likely  to  be  a  powerful  tool  for  designing  fast  and  complex  functional 
units. 
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3.2.1.2   Queuing  Techniques 

We  have  described  the  basic  operation  of  the  FIRFO  queues.  In  this 
section  we  will  provide  a  somewhat  more  detailed  analysis  of  their  operation 
and  an  analysis  of  their  size.  A  queued  instruction  is  allowed  to  proceed 
when  it  is  the  oldest  one  in  the  queue  and  when  all  required  resources  are 
available.  Resources  refer  to  the  unit  the  queue  drives  and  the  operands 
for  each  queued  instruction.  The  unit  becomes  available  at  a  time  deter- 
mined by  the  previous  instruction.  The  unit  will  instruct  the  queue  con- 
trol of  its  becoming  available  in  enough  time  to  allow  for  the  queue  search 
and  any  preliminary  set-up  steps  required.  The  determination  of  when 
operands  are  available  differs  with  different  types  of  units.  We  will 
briefly  describe  the  operation  of  the  vector  and  scalar  execution  unit 
queues  and  the  memory  queues.  These  units  will  be  described  in  detail  in 
various  sections  of  Chapter  4. 

There  will  be  associated  with  each  Vector  Execution  Unit  physical 
registers  for  operands  and  results.  When  a  vector  instruction  is  processed 
by  the  IUD,  it  will  transmit  instructions  to  switch  the  operands  to  the  VEU 
assigned  to  the  instruction.  It  will  assign  specific  physical  registers 
for  those  operands.  The  queued  instruction  within  the  VEU  is  ready  to  exe- 
cute when  those  specified  registers  are  loaded.  A  physical  register  for 
the  result  is  also  allocated  within  the  VEU.  Since  this  allocation  is  done 
by  the  IUD,  no  instruction  which  reaches  the  VEU  queue  can  be  held  up  for 
lack  of  a  place  to  put  the  result.  A  range  of  8  to  16  seems  a  reasonable 
size  for  this  queue.  This  estimate  is  based  on  the  fact  that  twice  the 
number  of  operand  registers  as  queue  entries  would  be  required  for  a  binary 
unit,  and  probably  no  more  than  8  queue  entries  could  be  checked  in  one 
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major  clock.  The  first  constraint  is  important  because  of  the  cost  of  the 
8-word  wide  parallel  buffer.  Because  there  is  likely  to  be  something  like 
a  6  major  clock  delay  between  when  buffers  are  reserved  by  the  IUD  and  when 
the  instruction  enters  the  VEU  queue,  we  would  want  more  than  two  registers 
per  queue  entry  for  a  unit  that  only  processes  binary  vector  instructions. 
The  second  constraint  is  important  because  once  it  takes  longer  to  search 
the  entire  queue  than  it  does  for  a  vector  instruction  to  execute,  it  be- 
comes increasingly  likely  that  in  doing  a  full  queue  search  an  earlier 
instruction  that  was  not  ready  to  execute  when  it  was  tested  will  become 
ready  to  execute.  Thus,  long  queue  searches  can  defeat  the  FIRFO  philosophy. 

Scalar  Execution  Units  only  have  internal  buffers  for  the  current  and 
next  operands  and  the  current  and  next  result.  All  scalar  results  from 
scalar  instructions  are  assigned  a  time  index.  A  scalar  operand  may  or  may 
not  have  a  time  index.  If  it  has  been  recently  computed,  it  will.  There 
are  many  fewer  time  indexes  than  physical  scalar  buffer  locations,  and  the 
time  indexes  are  constantly  being  recycled.  Thus,  a  scalar  operand  may 
refer  to  a  physical  location  whose  time  index  has  been  reused  and  is  no 
longer  associated  with  it.  The  mechanism  for  assigning  these  indexes  is 
described  in  detail  in  Section  4.3.  Any  scalar  operand  without  a  time  index 
is  available.  A  scalar  operand  with  a  time  index  may  be  available  in  either 
of  two  places.  For  each  of  these  buffers  there  is  a  set  of  bits,  one  for 
each  time  index  that  indicates  if  the  corresponding  operand  is  in  the 
respective  buffer.  The  first  of  these  is  simply  the  main  scalar  buffer. 
Two  queued  scalar  instructions  may  produce  results  to  the  same  physical 
scalar  buffer  location.  If  the  logically  later  of  these  is  ready  to  pro- 
ceed before  the  earlier,  it  will  store  its  result  in  a  special  result  buffer. 
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The  time  indexes  will  assure  that  the  correct  value  is  ultimately  stored  in 
the  scalar  buffer  and  that  intermediate  instructions  access  the  correct 
values.  Because  their  operand  buffers  are  not  separate  from  each  other, 
it  makes  sense  to  have  a  single  Scalar  Execution  Unit  queue  drive  all 
equivalent  SEUs.  This  fact  combined  with  our  earlier  observation  about 
queue  size  versus  queue  search  time  means  that  we  would  probably  want 
larger  scalar  queues  than  vector  queues.  A  size  of  16  would  probably  be 
reasonable. 

Each  memory  page  of  8  x  1  K  words  will  have  its  own  queue.  In  the 
case  of  all  instructions,  the  queue  control  must  insure  that  all  indexes 
and  modes  are  available,  i.e.,  have  been  transmitted  to  appropriate  buf- 
fers within  the  page.  Further,  it  must  insure  that  the  instruction  can 
proceed  without  producing  a  logical  error.  Various  schemes  could  be  used 
to  determine  this.  The  simplest  would  require  that  all  instructions  pro- 
ceed in  exactly  the  sequence  they  entered  the  queue.  One  could  allow  non- 
indexed  instructions  to  be  executed  out  of  sequence  if  their  addresses 
insured  that  no  conflicts  would  result.  In  the  most  general  case,  one 
could  do  arithmetic  on  all  available  and  relevant  indexes  and  modes  to  see 
if  any  instruction  could  proceed.  As  soon  as  any  instruction  with  an  un- 
available index  is  encountered,  the  queue  search  must  stop.  A  queue  size 
of  8  -  16  would  probably  be  reasonable  for  a  memory  page.  Experimentation 
might  reveal  that  queue  sizes  smaller  than  the  ones  we  have  suggested  for 
all  units  might  be  practical. 

In  all  the  above  cases  involving  local  buffers,  the  various  units  must 
notify  the  IUD  as  buffer  locations  become  available.  Thre  is  an  additional 
problem  associated  with  the  vector  result  buffers.  Values  from  these 
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buffers  may  be  accessed  as  operands  for  other  vector  instructions.  Thus, 
these  locations  may  be  used  until  the  corresponding  logical  location  is 
reused  in  the  OFFL  instruction  stream.  We  must,  however,  insure  there  is 
space  in  these  buffers  for  new  instructions.  Thus,  the  local  control  must 
initiate  a  transfer  of  some  of  these  operands  to  the  main  Vector  Buffer 
when  it  becomes  too  full. 

3.2.1.3   Resolving  Buffer  Access  Conflicts 

Our  local  control  and  queue  driven  structure  can  often  result  in  buf- 
fer access  conflicts.  Two  methods  for  handling  this  are  to  allow  multiple 
simultaneous  accesses  to  the  smae  memory  and  to  provide  hardware  for  con- 
flict resolution.  The  first  method  is  employed  in  some  of  the  IUD  tables 
because  of  the  necessity  for  very  high  access  rates.  This  involves  pro- 
viding multiple  addressing  logic  and  a  larger  fanout  from  each  bit  of 
storage.  It  makes  the  memory  considerably  mroe  expensive  and  is  thus  only 
used  when  required  by  the  data  rates.  For  the  other  more  common  case,  we 
have  developed  a  very  simple,  fast  and  cheap  circuit  for  conflict  resolu- 
tion. It  is  described  in  Section  4.4.2.4. 

3.2.2   Tables 

In  this  section  we  provide  a  general  description  of  the  hardware 
maintained  tables  and  the  algorithms  for  updating  and  accessing  them. 
Vector  tables  are  provided  to  map  physical  buffer  addresses  to  logical 
buffer  addresses  and  to  maintain  use  counts  for  active  buffer  locations. 
A  use  count  is  the  number  of  accesses  for  a  particular  register  that  have 
been  processed  by  the  IUD  but  have  not  yet  occurred.  Its  use  count  must  be 
zero  before  a  physical  buffer  location  can  be  reused.  The  original  assign- 
ment of  physical  to  logical  address  is  made  by  the  IUD  when  a  store  to  a 
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logical  address  occurs.  This  assignment  is  made  from  a  known  list  of  free 
physical  registers.  A  table  whose  addresses  correspond  to  logical  addresses 
contains  the  physical  address  for  each  assigned  logical  address.  This  table 
is  used  to  determine  the  physical  location  of  an  operand.  Every  access  to 
a  location  in  this  table  by  the  IUD  results  in  an  increase  in  the  use  count 
also  stored  in  that  location.  Periodically  information  is  obtained  from 
the  various  execution  units  giving  a  list  of  physical  addresses  which  have 
been  accessed.  This  list  is  used  to  access  the  table  in  an  associative 
fashion  and  decrement  the  use  counts.  When  a  use  count  goes  to  zero  after 
the  corresponding  logical  address  has  been  reused,  then  the  physical  ad- 
dress can  be  reused.  Both  conditions  are  necessary,  the  former  to  insure 
that  there  is  no  access  to  the  register  that  has  yet  to  be  made  and  the 
latter  to  insure  that  an  instruction  not  yet  processed  by  the  IUD  will 
require  the  value. 

A  similar  structure  is  provided  to  keep  track  of  scalars.  In  parti- 
cular, use  counts  must  be  maintained  to  insure  that  no  physical  scalar 
buffer  address  is  overwritten  while  there  is  a  queued  instruction  requir- 
ing that  value. 

3.2.3   Deadlock 

In  designing  a  machine  with  this  structure,  one  must  be  certain  that 
no  deadlocks  can  occur.  By  consistently  following  two  basic  design  con- 
straints, we  have  assured  this.  First,  no  instruction  gets  past  the  IUD 
unless  all  resources  required  for  its  execution  are  immediately  available. 
In  particular,  instructions  which  would  cause  a  memory  page  fault  do  not 
get  past  the  MID.  All  instructions  which  require  the  allocation  of  tempo- 
rary registers  have  that  allocation  made  within  the  IUD  from  a  set  of  known 
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registers  that  are  physically  available.  The  second  constraint  is  that 
whenever  a  required  resource  is  not  available  to  the  IUD,  it  ceases  pro- 
cessing all  instructions  until  the  resource  becomes  available.  For  example, 
if  an  instruction  requires  space  in  a  queue  that  is  full,  later  instruc- 
tions which  may  not  directly  use  that  queue  will  also  be  held  up.  Thus, 
no  instruction  will  enter  and  possibly  block  a  queue  because  it  is  depend- 
ent on  the  results  of  an  instruction  that  is  not  yet  in  a  queue.  In  con- 
junction with  the  first  constraint,  this  assures  that  once  an  instruction 
enters  a  queue,  all  its  operands  will  eventually  become  available  and  it  can 
proceed.  Thus,  certain  badly  balanced  instruction  sequences  could  degrade 
the  performance  of  this  structure,  but  no  instruction  sequence  could  com- 
pletely block  it. 
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3.3   PARALLELISM  -  AN  ABSTRACT  DISCUSSION 

This  section  is  a  general  philosophical  discussion  on  the  nature  and 
scope  of  parallel  computing  structures  and  is  not  immediately  related  to 
the  remainder  of  this  thesis.  We  will  suggest  a  possible  basis  for  relating 
the  problem  of  understanding  computing  structures  to  the  general  problem 
of  understanding  mathematical  structures.  We  will  not  be  presenting  estab- 
lished results,  but  rather  pointing  out  similarities  and  suggesting  possible 
approaches. 

It  is  a  great  luxury  in  conventional  computer  architecture  that  all 
words  of  main  memory  are  equally  accessible  by  the  processing  portion  of 
the  machine.  Parallelism  replaces  this  "amorphous"  topology  of  data  inter- 
action with  a  specific  structure.  In  a  totally  abstract  sense  the  problem 
of  parallel  computing  design  is  that  of  determining  classes  of  data  inter- 
action topologies  that  correspond  to  significant  real  problems  and  that  can 
be  mapped  in  an  efficient  way  to  a  single  computer  topology.  Mathematics 
is  the  study  of  arbitrary  abstract  structures.  Some  of  these  are  obviously 
and  directly  related  to  problems  of  computer  architecture.  We  will  describe 
some  of  these  direct  relationships  for  a  s/ery   substantial  portion  of  all 
mathematics  and  suggest  possible  approaches  to  investigate  computer  archi- 
tecture utilizing  this  body  of  mathematics.  We  will  then  briefly  explain 
why  we  believe  the  study  of  structures  relevant  to  computation  includes 
all  of  mathematics.  Finally,  we  will  discuss  some  of  the  implications  of 
this  point  of  view  for  a  theory  of  mathematical  truth. 

Two  fundamentally  different  measures  of  the  strength  of  a  mathematical 
system  are  provability  and  definability.  The  former  refers  to  what  questions 
can  be  decided  by  the  system  and  the  latter  refers  to  what  questions  can  be 
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stated  within  the  system.  As  we  have  suggested  elsewhere  [2],  one  can 
directly  relate  mathematics  through  the  hyperarithmetical  sets  to  compu- 
tation related  structures.  We  can  begin  with  some  language  adequate  to 
describe  all  finite  state  machines.  We  godel  number  all  statements  within 
this  language.  We  have  a  separate  godel  numbering  for  all  Turing  machines 
with  blank  input  tapes.  We  code  the  output  of  these  so  that  they  either 
represent  the  godel  number  of  another  Turing  machine  or  the  godel  number 
of  a  statement  in  our  language  describing  finite  state  machines.  We  now 
assign  truth  values  to  each  of  the  Turing  machines  as  follows:  the  truth 
value  of  a  Turing  machine  is  true  if  it  has  an  unbounded  number  of  outputs 
and  the  truth  value  for  each  member  in  some  unbounded  subset  of  these  is 
true;  the  truth  value  for  any  output  corresponding  to  a  finite  state  machine 
statement  is  true  if  the  statement  is  true.  This  structure  is  completely 
adequate  to  define  all  hyperarithmetical  statements.  This  encompasses  most 
mathematical  questions  and  includes  a  broad  area  that  Intuitionist  mathe- 
maticians consider  to  be  meaningless. 

Central  to  this  level  of  mathematical  definability  are  the  two  related 
concepts  of  constructive  ordinals  in  mathematics  and  non-deterministic 
Turing  machines  in  computer  science.  The  proof  that  eyery   constructive 
ordinal  has  a  recursive  notation  defined  in  a  particularly  technical  way  [7] 
can  be  interpreted  as  demonstrating  that  there  is  a  non-deterministic  Turing 
machine  that  recursively  describes  completely  the  structure  of  any  con- 
structive ordinal.  Mathematical  questions  about  hyperarithmetical  sets 
are  those  which  result  from  "iterating"  up  to  some  constructive  ordinal  the 
question  is  there  an  infinite  subset  of  all  true  statements  in  a  recursively 
enumerable  collection  of  statements  about  finite  state  machines. 
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Constructive  or  recursive  ordinals  can  also  be  used  as  a  measure  of  the 
power  of  a  mathematical  system  in  terms  of  provability.  Loosely  speaking, 
the  larger  the  recursive  ordinal  that  can  be  proven  to  be  a  recursive 
ordinal  in  a  system,  the  more  powerful  it  is  in  terms  of  provability.  There 
are  many  mathematical  languages  rich  enough  to  define  all  recursive  ordinals 
but  no  mathematical  theory  rich  enough  to  prove  that  some  definition  in  the 
language  does  define  a  recursive  ordinal  for  each  recursive  ordinal. 

The  concept  of  recursive  ordinal  can  be  thought  of  as  a  sort  of  measure 
or  classification  of  level  of  complexity  for  an  initial  segment  of  mathema- 
tical structures.  We  suggest  that  this  classification  of  structures  might 
be  a  good  starting  point  in  a  search  for  classifying  various  topologies  of 
data  interaction.  As  an  example,  the  initial  recursive  ordinals  correspond 
in  a  fairly  direct  way  to  the  elementary  mathematical  operations  of  addition, 
multiplication,  and  exponentiation.  These  each  have  different  and  increas- 
ing complex  topologies  of  bit  interactions.  Different  techniques  of  logi- 
cal design  are  required  in  providing  time  versus  gate  count  tradeoffs  in 
implementing  them.  The  concept  of  recursive  ordinals  provides  a  detailed 
and  direct  method  of  extending  this  hierarchy  to  more  complex  structures. 

Further,  it  is  my  belief  that  the  concept  of  recursive  ordinals  is 
directly  connected  to  the  compuer  science  concept  of  iteration.  This  rela- 
tionship tends  to  be  obscured  by  a  modern  set  theory  treatment  of  ordinals. 
Modern  set  theory  originated  from  an  attempt  to  avoid  the  paradoxes  dis- 
covered by  Bertrand  Russell  in  earlier  versions  of  set  theory.  It  seems  to 
do  so  in  an  extremely  elegant  and  powerful  way.  However,  returning  to  the 
intuition  that  led  Ressell  to  discover  the  paradoxes  and  the  resulting  less 
elegant  and  less  powerful  theory  of  types  that  he  proposed  as  a  solution  will 
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shed  considerable  light  on  the  relationship  between  set  theory  and  a 
computer  related  theory  of  iteration.  The  paradoxes  arose  from  sets  with 
self  referencing  definitions  constructed  in  such  a  manner  that  if  some 
element  was  a  member  of  the  set  then  one  could  show  it  was  not  a  member  of 
the  set.  Russell's  solution  was  to  provide  a  sort  of  index  associated 
with  all  statements  used  in  defining  sets.  This  index  provided  a  limit 
over  the  type  of  set  used  in  the  definition.  The  set  being  defined  would 
have  a  higher  index  or  type.  Ordinal  numbers  including  the  recursive 
ordinals  implicitly  form  a  similar  indexing  scheme  for  set  theory.  We  can 
consider  the  problem  of  iteration  as  that  of  applying  various  algorithms  to 
each  other.  The  problem  of  possible  contradictions  is  replaced  by  the 
problem  of  whether  the  resulting  algorithm  computes  a  value  or  simply  loops 
forever.  It  is  possible  to  consider  iterations  on  a  hierarchy  of  functions. 
For  example,  we  can  start  with  algorithms  which  compute  integers  from  inte- 
gers. We  can  then  consider  functions  which,  given  a  function  of  this  first 
type  and  an  integer,  computes  an  integer.  Given  any  type,  we  can  consider 
a  function  of  all  lower  types.  Given  an  effective  procedure  for  listing  an 
infinite  number  of  types,  we  can  consider  a  function  of  a  Turing  Machine 
which  enumerates  an  infinite  sequence  of  such  types.  Using  such  types, 
we  can  construct  more  powerful  techniques  of  iteration.  We  can  also  con- 
struct larger  recursive  ordinals.  Finally,  the  topology  of  the  interaction 
of  the  original  operands  becomes  more  complex  and  more  general  as  we  go 
to  higher  types. 

We  are  not  suggesting  that  any  of  these  approaches  are  uniquely  correct, 
but  rather  pointing  out  similarities  and  suggesting  that  each  field  and 
each  approach  may  benefit  from  insights  of  the  others.  We  would  now  like 
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to  outline  why  we  believe  reasoning  about  physically  implementable  processes 
is  relevant  to  the  outer  reaches  of  mathematical  research.  We  consider 
both  the  problems  of  definability  and  provability.  Problems  associated 
with  the  set  of  all  real  numbers  provide  the  first  obstacle  to  providing 
constructive  interpretations  for  all  of  mathematics.  Cantor's  proof  that 
there  cannot  exist  a  one-to-one  map  from  the  integers  to  the  reals  makes 
it  impossible  to  provide  any  constructive  method  of  naming  all  the  reals. 
Cantor  did  not  prove  that  there  were  more  reals  than  integers  since  the 
existential  status  of  reals  is  in  question.  A  possible  interpretation  of 
reals  is  that  they  represent  properties  of  Turing  Machines.  One  can  con- 
sider that  the  "meaningful  properties"  of  Turing  machines  that  one  could 
invent  might  be  limitless.  By  meaningful  property  we  mean  a  property  that 
is  either  true  or  false  for  any  given  Turing  Machine.  Thus,  each  such  pro- 
perty under  a  particular  godel  numbering  of  Turing  Machines  defines  a  real. 
We  can  reflect  the  open-ended  nature  of  the  situation  by  employing  a  lan- 
guage for  describing  properties  in  which  an  infinite  sequence  of  words  are 
always  left  undefined.  This  seems  to  me  to  be  a  particularly  desirable 
approach  since  it  more  closely  reflects  the  reality  of  the  situation.  We 
know  from  the  Lowenheim  Skolem  theorem  that  any  mathematical  theory  with 
recursively  enumerable  axioms  has  a  countable  model.  This  approach  by  it- 
self would  not  be  adequate  to  construct  a  constructive  version  of  set  theory. 
However,  examining  the  actual  combinatorial  power  of  the  axioms  of  set 
theory  and  seeing  if  similar  constructive  interpretations  are  possible 
seems  to  me  to  be  likely  to  be  successful. 

We  now  consider  the  problems  associated  with  providing  constructive 
interpretations  for  set  theory  in  the  domain  of  provability.  In  doing  so, 
we  will  confront  what  is  probably  the  major  philosophical  problem  with  the 
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approach  we  are  suggesting.  Mathematics  is  the  one  area  of  human  endeavor 
that  is  generally  considered  to  have  a  claim  to  absolute  truths.  Godel's 
Incompleteness  theorem  showed  that  there  exist  fundamental  problems  with 
allowing  mathematics  to  grow  and  at  the  same  time  retain  the  property  of 
possessing  absolute  truths.  One  school  of  mathematics  has  jettisoned  all 
but  totally  constructive  proofs  as  a  means  of  insuring  the  absoluteness  of 
mathematical  truth.  The  Intuitionists  do  not  even  accept  the  statement  that 
any  Turing  Machine  must  either  halt  or  continue  indefinitely.  On  the  op- 
posite end  of  the  spectrum  we  have  what  might  be  considered  the  mystical 
school  of  mathematics.  This  is  the  belief  that  intuition  about  infinite 
sets  allows  mathematicians  to  transcend  the  limits  of  Godel's  Incomplete- 
ness theorem  when  dealing  with  constructive  processes.  As  far  as  I  am 
aware,  no  one  has  seriously  considered  the  possibility  that  mathematics 
should  give  up  its  claim  to  absolute  truth  outside  of  a  narrow  domain  and 
become  a  speculative  and  experimental  science.  Our  suggestion  for  handling 
the  concept  of  real  numbers  is  made  in  this  spirit. 

Godel's  Incompleteness  theorem  established  that  no  mathematical  theory 
in  which  a  Universal  Turing  Machine  was  imbedable  and  in  which  the  halt- 
ing problem  could  be  defined  could  decide  within  itself  if  it  were  consist- 
ent. This  established  severe  limits  for  any  formal  mathematical  system 
with  respect  to  its  power  of  provability.  For  any  such  "true"  system  one 
can  adjoin  the  statement  that  the  system  is  consistent  and  obtain  a  more 
powerful  system.  In  fact,  one  can  regard  the  high  power  that  set  theory 
has  in  a  provability  sense  as  deriving  from  the  powerful  methods  available 
within  it  for  taking  a  powerful  kernal  system  and  iterating  the  statement 
that  the  system  is  consistent.  This  is  accomplished  via  the  strong  axioms 
of  infinity  that  allow  one  to  construct  models  of  increasingly  more  powerful 
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subsystems.  If  one  can  construct  a  model  for  a  system,  one  has  a  proof 
that  it  is  consistent.  An  alternative  approach  would  be  to  directly  study 
and  attempt  to  enhance  the  combinatorial  power  of  this"iterative"  process. 
But  to  attack  the  problem  from  that  direction  would  require  giving  up  the 
notion  that  the  results  were  absolute  truth. 

This  non-absolutist  approach  to  mathematical  truth  has  a  philosophi- 
cal appeal.  Perhaps  the  severest  problem  associated  with  the  accomplish- 
ments of  Western  mathematics,  science  and  technology  is  recognizing  the 
limits  of  these  endeavors.  It  is  fitting  that  the  queen  of  the  sciences 
be  the  first  to  establish  precise  limits  for  its  power  and  scope.  It  is 
essential  that  we  know  what  we  do  not  know,  otherwise  we  know  nothing. 
That  is  why  mathematics  is  so  concerned  with  avoiding  contradiction. 
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4   COMPUTATION  UNIT  -  DETAILED  LOGICAL  DESIGN 

In  this  section  we  provide  a  detailed  logical  design  for  the  computa- 
tion unit  of  Figure  2.  We  will  briefly  describe  its  overall  physical 
structure.  We  will  then  describe  its  overall  functional  structure.  We 
will  then  proceed  to  a  detailed  functional  and  logical  design  of  sufficient 
detail  to  provide  realistic  gate  counts.  In  various  subsections  we  will 
provide  tables  giving  approximate  gate  counts  for  individual  units.  In  a 
concluding  section  we  will  provide  a  summary  gate  count  for  the  entire 
computation  unit.  In  this  section  we  will  group  buffers  by  their  access 
times  and  compute  their  gate  counts  separately.  This  scheme  is  intended 
to  give  a  very  rough  notion  of  the  logical  complexity  and  cost  of  this 
design. 
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4.1   OVERALL  STRUCTURE 

Figure  2  can  be  partitioned  into  four  major  units.  We  will  refer  to 
these  as  the  scalar  portion,  the  vector  portion,  memory,  and  the  Instruction 
Unit  Dispatcher.  The  scalar  and  vector  portions  are  symmetric  in  the  sense 
that  they  both  consist  of  up  to  six  execution  units,  a  buffer,  and  a  switch. 
The  execution  units  are  the  portions  of  the  machine  that  do  all  actual  com- 
putation. The  switches  operate  under  hardware  control  and  are  responsible 
for  transferring  data  between  buffers,  memory,  and  execution  units.  The 
vector  buffer  is  a  more  or  less  conventional  high-speed  buffer  for  the  main 
vector  memory.  There  is  also  vector  buffer  space  within  the  VEUs.  The 
Scalar  Buffer  is  the  primary  memory  for  scalars.  It  can  be  loaded  via  the 
Vector  Switch  from  main  memory  for  initialization.  There  are  additional 
buffers  associated  with  the  scalar  portion.  They  exist  to  enhance  the 
throughput  of  the  scalar  portion  and  will  be  described  in  detail.  The  In- 
struction Unit  Dispatcher  is  the  most  complex  and  unconventional  or  the 
major  units.  It  has  responsibility  for  mapping  OFFL  instructions  into  queue 
entries  which  drive  the  other  units. 


53 


4.2   FUNCTIONAL  STRUCTURE 

The  functional  structure  can  be  thought  of  as  a  generalization  of  the 
algorithms  used  to  sequence  the  arithmetic  on  the  IBM  360/91  [8]. 
All  of  the  resources  of  the  machine  are  queue  driven.  The  queues  are  not 
strictly  first  in  first  out,  but  rather  first  in  which  is  able  to  begin 
using  the  resource,  first  out.  We  will  refer  to  these  as  FIRFO,  i.e., 
first  in  and  ready,  first  out.  An  instruction  is  ready  when  its  operands 
become  available.  What  constitutes  an  available  operand  will  vary  with  dif- 
ferent types  of  functional  units.  This  structure  allows  the  sequence  of 
instructions  to  be  permuted  in  any  way  which  enhances  resource  utilization 
without  altering  the  logical  structure  of  the  original  program.  It  is  the 
responsibility  of  the  IUD  to  insure  the  logical  integrity  of  the  original 
program.  Most  of  the  complexity  of  the  IUD  is  a  result  of  this  function. 
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4.3   SCALAR  PORTION  OF  COMPUTATION  UNIT 

The  scalar  portion  of  the  computation  unit  allows  us  to  perform  opera- 
tions on  scalars  without  tying  up  the  vector  execution  units.  In  addition 
it  contains  a  high-speed  memory  with  sufficient  space  for  the  scalars  in 
almost  any  program.  The  units  that  actually  perform  scalar  operations  are 
constructed  in  a  modular  fashion  to  allow  for  the  construction  and  use  of 
specialized  hardware  at  any  time  during  the  operational  life  of  the  machine. 

4.3.1   Overall  Structure  of  the  Scalar  Portion  of  Computation  Unit 

Figure  5  shows  the  structure  of  the  scalar  portion  of  the  execution 
unit.  We  will  briefly  describe  the  functions  of  each  of  the  units  in  the 
figure  and  the  nature  of  their  interconnections.  The  Scalar  Execution 
Units  contain  the  queues  control  and  logic  to  sequence  and  perform  the 
scalar  operations.  These  units  receive  instructions  from  the  SIDS  through 
the  instruction  switch.  The  execution  units  make  use  of  the  tables  in  the 
Scalar  Buffer  Status  unit  to  determine  admissible  instruction  sequencing. 
The  execution  units  also  provide  information  for  updating  these  status 
tables  as  instructions  are  executed.  Every  major  clock  the  Scalar  Buffer 
Status  unit  and  the  SIDS  exchange  information  to  update  their  respective 
status  tables.  The  functional  structure  of  the  SIDS  relevant  to  sequencing 
instructions  will  be  described  in  Section  4.3.2.1.  Detailed  design  of  the 
entire  SIDS  is  in  Section  4.6.5.  The  Result  Buffer  is  used  to  buffer  re- 
sults that  would  otherwise  overwrite  operands  needed  by  instructions  waiting 
to  execute.  Its  contents  may  be  accessed  as  operands  and  will  eventually  be 
transferred  to  the  Scalar  Buffer  through  the  Scalar  Switch.  The  Vector- 
Scalar  Buffer  is  used  for  transferring  scalars  between  the  VEUs  and  the  SEUs. 
It  is  addressable  as  if  it  were  an  extension  of  the  Scalar  Buffer,  but  it 
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has  a  special  status  table  in  the  Scalar  Buffer  Status  unit  that  must  be 
updated  with  information  from  both  the  SEUs  and  VEUs.  The  Scalar  Switch 
does  all  operand  transfers  between  the  various  SEUs  and  buffers.  A  special 
status  table  is  maintained  for  operands  about  to  be  transferred,  and  they 
can  be  accessed  by  other  SEUs  without  going  through  the  Scalar  Buffer.  The 
Scalar  Switch  Control  does  what  its  name  implies.  The  Scalar  Buffer  is  the 
primary  memory  for  scalars.  It  can  be  accessed  by  the  Scalar  Switch  and 
can  accept  data  in  blocks  equal  to  the  standard  vector  width  from  the  vector 
switch.  Reverse  transfers  from  the  scalar  buffer  to  the  Vector  Switch  are 
not  allowed.  The  primary  function  of  the  Scalar  Buffer  Control  is  to  referee 
between  the  Scalar  Switch  and  Vector  Switch  in  their  competition  for  the 
Scalar  Buffer.  We  will  provide  details  of  the  function  and  structure  of 
these  units.  We  begin  by  discussing  the  Scalar  Execution  Units. 

4.3.2   Scalar  Execution  Units 

Figure  6  shows  the  overall  structure  of  the  scalar  execution  unit.  The 
Sequencer  provides  overall  control  of  the  unit.  It  reads  instructions  in  the 
Queue,  checks  the  status  of  the  operands  in  the  Scalar  Buffer  Status  tables, 
and  on  this  basis,  determines  the  sequencing  of  the  queued  instructions. 
When  an  instruction  is  to  be  executed,  the  scalar  switch  must  be  provided 
with  requests  to  access  the  operands  from  the  appropriate  buffers.  The  ope- 
rand and  results  are  provided  with  buffers  to  allow  a  continuous  flow  of 
operands.  A  special  switch  is  provided  to  allow  results  to  be  used  as  ope- 
rands without  going  through  the  scalar  switch.  The  computation  hardware 
contains  the  logic  to  perform  the  actual  scalar  operations.  Working  registers 
are  included  in  the  figure  to  emphasize  the  buffering  function  of  the  other 
registers.  If  an  interrupt  condition  occurs,  the  MID  will  be  notified  and 
processing  will  continue. 
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We  will  discuss  the  algorithm  for  sequencing  instructions.  Then  we 
will  provide  detailed  design  for  the  Instruction  Queue  and  for  the  Sequencer. 

4.3.2.1   Scalar  Instruction  Sequencing 

Scalar  operands  and  results  are  not  uniquely  identified  by  a  physical 
memory  address.  The  queues  allow  considerable  flexibility  in  the  sequence  in 
which  instructions  are  actually  executed.  One  price  of  this  flexibility  is 
the  necessity  of  including  special  hardware  to  insure  that: 

1.  No  scalar  memory  location  is  overwritten  when  its  contents  are 
still  needed  for  some  queued  instruction. 

2.  No  operand  is  fetched  before  the  instruction  that  computes  that 
operand  has  completed. 

To  allow  for  this,  we  will  think  of  scalar  addresses  as  also  including  a 
time  index  to  uniquely  identify  logical  values.  In  addition,  whenever  an 
instruction  enters  the  scalar  queues,  a  use  count  for  all  operands  will  be 
provided.  No  store  to  scalar  instruction  will  be  allowed  to  execute  if  the 
physical  address  of  its  result  has  a  non-zero  use  count  for  some  time  index 
earlier  than  the  time  index  of  the  instruction  in  question. 

We  wish  to  minimize  the  complexity  and  cost  of  the  logic  that  does  this 
bookkeeping.  As  discussed  in  Section  3.2.1.2,  there  will  be  up  to 
six  scalar  queues,  each  with  a  capacity  of  8  to  16.  In  addition,  it  is 
reasonable  to  assume  that  instructions  will  be  executed  in  a  reasonably  uni- 
form manner.  Whenever  this  is  not  the  case,  the  unit  is  likely  to  be  blocked 
anyway  due  to  whatever  is  causing  the  long  wait  on  some  instructions.  Thus 
between  128  and  256  is  likely  to  be  an  adequate  range  for  active  indices. 
In  the  case  where  this  is  not  adequate,  we  must  halt  processing  of  instruc- 
tions by  the  IUD  until  older  instructions  have  completed.  We  will  now 
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describe  the  detailed  algorithms  to  insure  correct  instruction  sequencing. 
The  description  we  give  here  will  be  functionally  complete.  It  will  not 
include  the  details  of  what  constitutes  a  complete  instruction  or  the  spe- 
cial circuitry  that  allows  results  to  be  used  as  operands  without  going  to 
and  from  scalar  memory.  This  will  be  discussed  in  Section  4.4.4  where  we 
describe  the  scalar  switch. 

Only  those  logical  scalar  addresses  that  have  recently  occurred  as 
results  have  time  indexes  associated  with  them.  All  operands  occurring  in 
the  queues  must  have  a  use  count  associated  with  them  to  insure  that  they 
will  not  be  overwritten.  Thus,  there  will  be  two  tables  associated  with  the 
scalar  buffer  that  provide  use  counts.  The  first  of  these  will  have  one 
entry  for  each  time  index.  In  addition  to  the  physical  scalar  buffer  ad- 
dress and  use  count,  each  entry  contains  a  link  and  status  information. 
The  status  information  indicates  if  this  is  the  oldest  or  youngest  refer- 
ence to  that  location  in  this  table  and  also  indicates  if  the  corresponding 
logical  operand  is  now  available  in  the  scalar  buffer.  The  link  points  to 
the  next  oldest  reference  to  the  same  physical  location.  The  second  table 
contains  a  use  count  and  the  physical  address  that  the  use  count  is  for. 

Before  any  instruction  enters  the  queues,  whenever  an  instruction  is 
being  checked  for  being  ready  to  execute,  whenever  any  operand  is  accessed, 
and  finally,  whenever  any  result  is  stored,  these  tables  must  be  accessed. 
We  will  now  describe  the  algorithms  required.  When  an  instruction  enters 
the  queue,  the  use  count  for  its  operands  must  be  incremented  and  entries 
made  for  its  result.  If  those  operands  have  a  time  index,  this  may  be 
interpreted  as  an  address  to  the  first  status  table,  and  the  associated  use 
count  must  be  incremented.  If  there  is  no  time  index,  then  an  associative 
access  of  the  second  status  table  is  required.  If  an  entry  is  found,  then 
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its  use  count  must  be  incremented.  If  no  entry  is  found,  then  a  new  entry 
must  be  constructed  with  a  use  count  of  one.  Before  an  instruction  enters 
the  scalar  queues,  a  new  entry  for  the  scalar  result  must  be  made  in  the 
first  scalar  status  table,  and  links  within  the  table  must  be  updated.  An 
associative  access  of  the  table  is  required  to  determine  the  most  recent 
entry  referring  to  the  same  physical  location.  The  link  in  this  location 
must  be  set  to  point  to  the  new  entry,  and  the  status  bits  set  to  indicate 
that  this  is  no  longer  the  youngest  table  entry.  The  new  entry  has  its 
link  field  cleared  and  its  status  set  as  being  the  youngest  entry.  In  addi- 
tion, the  status  is  set  to  indicate  this  value  is  not  yet  available  in  the 
scalar  buffer,  and  if  there  were  no  younger  entry  referring  to  this  physical 
location,  its  status  is  set  as  the  oldest  entry  referring  to  this  location. 

We  will  now  discuss  the  algorithms  for  updating  the  tables  when  an 
operand  is  accessed  by  the  SEUs.  If  this  operand  does  not  have  a  time  index, 
then  the  second  table  is  searched,  and  the  corresponding  entry  has  its  use 
count  decremented.  If  the  use  count  goes  to  zero,  then  the  location  is 
cleared  and  marked  available  for  reuse.  If  there  is  a  time  index,  then  the 
specified  location  of  the  first  table  has  its  use  count  decremented.  If  this 
use  count  goes  to  zero,  then  there  are  two  possible  courses  of  action  that 
may  be  required.  If  the  link  of  this  entry  is  non-zero  and  there  is  there- 
fore a  younger  reference  to  this  physical  location,  then  we  know  that  all  new 
instructions  entering  the  queues  with  operands  having  this  physical  address 
will  refer  to  the  entry  pointed  to  by  the  link  or  another  more  recent  entry. 
Thus,  this  entry  in  the  table  can  be  cleared  and  marked  as  free.  In  addition, 
a  signal  must  be  sent  to  the  IUD  indicating  that  this  time  index  is  now 
available  to  be  reused.  If  there  is  no  other  entry  with  this  physical 
address,  then  the  entry  is  left  in  the  table  with  a  zero  use  count. 
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This  is  desirable  because  an  instruction  yet  to  enter  the  queues  may  use 
this  operand.  If  this  entry  were  cleared  when  the  later  instruction  was 
processed,  a  new  entry  in  the  other  scalar  status  table  would  be  required. 

Since  we  do  not  necessarily  clear  the  first  table  when  a  use  count  goes 
to  zero  in  it,  additional  hardware  is  necessary  to  insure  that  new  entries 
do  not  overwrite  needed  information.  In  addition,  this  hardware  can  make 
sure  that  no  problems  result  from  the  limited  number  of  time  indexes.  The 
algorithm  for  doing  this  is  to  make  sure  at  successive  points  in  time  that 
all  the  results  that  might  be  processed  during  the  next  interval  have  space 
available  for  them.  Thus,  the  hardware  must  continually  test  and  if  possible 
clear  a  block  of  locations.  If  any  of  these  have  a  non-zero  link,  they  can- 
not be  cleared,  and  the  IUD  must  be  given  a  signal  to  wait.  In  addition,  if 
the  second  table  becomes  full,  the  IUD  will  also  be  required  to  wait. 

The  design  techniques  to  construct  hardware  for  the  algorithms  described 
in  this  section  are  the  same  as  those  employed  in  constructing  the  IUD.  In 
a  sense,  this  hardware  may  be  regarded  as  an  extension  of  the  IUD  that  re- 
sides physically  inside  the  scalar  processing  part  of  the  machine.  We  will 
refer  to  this  hardware  as  the  Scalar  Instruction  Dispatch  Subsystem  (SIDS). 
We  will  provide  a  detailed  design  for  this  unit  in  Section  4.6.5  when  we 
describe  the  remainder  of  the  IUD. 

The  one  function  of  these  tables  that  we  have  yet  to  discuss  is  their 
use  in  determining  if  an  instruction  is  ready  to  be  executed.  We  must  deter- 
mine if  the  operands  are  available.  We  need  also  to  determine  if  the  result 
can  go  directly  to  the  Scalar  Buffer  or  if  it  must  go  to  the  Result  Buffer. 
This  problem  is  further  complicated  by  the  fact  that  there  may  be  up  to  six 
SEUs  simultaneously  determining  which  of  their  queue  entries  are  ready  to  be 
executed.  In  order  to  keep  the  communication  between  the  SEUs  and  the  SIDS 
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to  a  manageable  level  and  to  simplify  the  design  of  both  units,  we  will 
provide  the  SEUs  with  their  own  local  set  of  tables.  We  will  further  re- 
strict communication  between  these  units  to  block  transfers  of  information 
occurring  once  in  each  major  clock.  From  the  previous  discussion,  we  see 
that  at  ewery  major  clock  the  SEUs  must  provide  the  SIDS  with  a  list  of  all 
operands  accessed  in  the  previous  major  clock.  There  will  be  at  most  twelve 
of  these  operands.  Before  we  can  determine  the  information  that  flows  in 
the  other  direction,  we  need  to  define  in  detail  the  function  and  structure 
of  the  SEU  tables. 

The  SEU  tables  must  serve  four  functions: 

(1)  Determine  which  instructions  are  ready  to  be  executed. 

(2)  Determine  which  available  operands  are  in  the  Result  Buffer  and 
which  results  can  go  directly  to  the  Scalar  Buffer. 

(3)  Determine  when  a  result  can  be  transferred  from  the  Result  Buffer 
to  the  Scalar  Buffer. 

(4)  Accumulate  the  list  of  operands  needed  by  the  SIDS. 

An  instruction  can  be  executed  if  its  operands  are  available.  The  operands 
are  available  if  they  do  not  have  a  time  index.  The  SEUs  must  be  provided 
with  a  list  of  all  time  indexed  operands  which  are  available.  It  must  have 
a  means  of  updating  this  information  locally  until  the  information  it  is 
receiving  from  the  SIDS  has  had  time  to  absorb  those  particular  updates.  To 
know  which  time  indexed  operands  are  initially  available,  only  a  single  bit 
for  each  possible  time  index  is  required.  To  know  which  time  indexed  operands 
have  become  available  requires  setting  the  available  bit  for  all  results  as 
they  are  computed.  Thus,  all  that  is  required  for  the  first  function  listed 
is  an  addressable  array  of  available  bits,  one  for  each  possible  time  index. 
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At  this  point,  we  should  mention  that  an  operand  is  considered  available 
if  a  request  has  been  made  to  send  it  on  the  scalar  switch  immediately  after 
it  has  been  computed.  Thus,  we  require  a  special  table  of  these  pending 
requests. 

The  second  function  listed  previously  concerns  itself  with  determining 
where  the  operands  are  and  where  the  result  goes  once  the  decision  to  execute 
a  particular  instruction  has  been  made.  Possible  sources  of  operands  are: 
Result  Buffer,  Scalar  Buffer,  about  to  appear  on  the  Scalar  Switch,  and  an 
operand  or  result  of  the  previous  instruction  inside  the  SEU  requiring  it. 
This  last  case  is  handled  entirely  by  the  SEU.  Associative  memories  are 
required  to  determine  which  operands  are  in  the  Result  Buffer  or  are  about 
to  appear  on  the  Scalar  Switch.  Constructing  these  tables  does  not  require 
any  information  from  the  SIDS.  Determining  which  results  can  go  directly 
to  the  Scalar  Buffer  does  require  a  single  bit  of  information  for  each  pos- 
sible time  index.  This  information  does  not  have  to  be  terribly  current, 
but  only  updated  with  sufficient  frequency  to  keep  the  Result  Buffer  from 
becoming  full . 

The  third  function  of  transferring  Result  Buffer  values  to  the  Scalar 
Buffer  requires  the  same  one  bit  of  information  for  each  time  index  that 
states  if  that  Scalar  Buffer  location  may  be  overwritten. 

The  last  function  requires  that  all  operand  accesses  by  the  SEUs  be 
recorded  for  periodic  transfer  to  the  SIDS. 

4.3.2.2   Scalar  Queues 

The  scalar  queue  we  design  in  this  section  will  be  reused  with  minor 
modifications  in  both  the  VEUs  and  the  Vector  Switch.  In  all  cases,  the 
queues  can  be  thought  of  a  containing  two  or  three  address  instructions. 
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Functionally  the  queue  is  scanned  from  oldest  element  first  until  an  in- 
struction with  all  operands  available  is  encountered.  This  instruction  is 
then  chosen  for  execution.  The  queues  only  serve  as  special  memories  to 
contain  the  instructions.  Control  sequencers  read  the  queues  and  make  de- 
cisions based  on  external  conditions. 

We  will  now  discuss  the  specific  functions  the  queues  must  serve  and 
the  design  we  have  chosen. 

The  queue  must  provide  rapid  access  to  its  contents  in  the  sequence  in 
which  they  were  stored.  It  must  allow  for  the  deletion  of  any  of  its  ele- 
ments without  affecting  the  order  of  the  remainder.  It  must  be  able  to 
accept  inputs  rapidly  and  at  any  clock  interval.  Figure  7  shows  the  in- 
ternal structure  of  the  instruction  queue.  Instructions  enter  through  the 
bottom  of  the  queue  and  may  exit  from  any  point  through  the  test  selector. 
Each  register  can  be  shifted  into  its  neighbor  next  higher  in  the  queue. 
Thus,  if  the  kth  element  exits  the  queue,  all  registers  further  down  in  the 
queue  can  be  shifted  up  one  place  while  those  higher  in  the  queue  remain  as 
they  are.  The  control  bits  keep  track  of  which  elements  are  to  be  shifted, 
which  element  the  test  selector  is  selecting,  and  allows  for  the  migration 
of  a  new  entry  up  to  the  current  logical  bottom  of  the  queue.  The  control 
bit  logic  updates  the  control  bits.  Table  3  gives  the  logical  functions 
that  this  unit  must  compute.  Table  4  provides  estimates  of  gate  counts 
and  speed  for  the  instruction  queue. 

4.3.2.3   SEU  Sequence  Controller 

The  sequence  controller  must  examine  queue  entries,  determine  which 
instructions  have  operands  available,  and  on  that  basis  set  up  instructions 
for  processing.  We  will  describe  in  detail  the  algorithms  involved  and  the 
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TABLE  3    INSTRUCTION  QUEUE  OPERATION 


Abbreviation   Description 
Control  Bits 

A       Add  to  queue 


B        Bottom  of  queue 

W       Word  examined 
W        Word  to  pop 

Instructions  from  Sequencer 


N. 


Next  element 


Pop  queue 


Instructions  from  Queue 


R. 


Reset  W. 


R        Reset  W 
Actions  of  Queue  Element 
S       Shift  output 


E       Element  output 


Explanation 


This  bit  is  set  when  a  new  element 
enters  the  bottom  of  the  queue.  As  the 
element  migrates  up  the  queue,  this  bit 
moves  along  until  the  bottom  of  the  queue 
is  reached. 

Designates  the  highest  cell  in  the  queue 
which  is  not  currently  occupied. 

Set  for  the  next  queue  entry  to  go  out. 

Set  for  the  element  that  can  be  currently 
popped  from  queue  and  for  all  higher 
elements. 


Output  next  element  for  the  sequencer  to 
test. 

Pop  designated  element  out  of  queue. 


Shift  contents  of  register  to  next  queue 
element. 

Output  register  to  the  sequencer. 


The  following  conventions  are  used  in  the  description  of  queue  operations: 


S(Aq) 
R(Aq) 

V1 


Set  control  bit  A  . 

Reset  control  bit  A  . 

Indicates  A  is  set  in  the  next  lowest  queue  element. 

Indicates  A  is  reset. 
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TABLE  3    INSTRUCTION  QUEUE  OPERATION  (cont.) 

Event  Action  Explanation 

Advance  new  entry  up  one  position. 

Receive  new  entry  and  set  control  bit. 

New  entry  reaches  bottom  of  queue. 

New  entry  reaches  bottom  of  queue.  New 
bottom  is  marked. 

Output  queue  entry. 

Mark  next  entry  to  be  output. 

Advance  word  to  be  popped  marker. 

Word  being  examined  has  reached  bottom 
of  queue. 

Word  to  be  popped  has  reached  bottom  of 
queue. 

Clear  W  .  (The  top  of  queue  will  have  W 
set.)  e  e 

Clear  W  .  (The  top  of  queue  will  have  W 
set.)  p  p 

Pop  queue.   (The  top  of  queue  will  have 

Wrt  set.) 
e 

Move  elements  below  element  popped  up  one 
in  queue. 

Move  bottom  of  queue  up  one. 

Move  bottom  of  queue  up  one. 


\ 

So  R<\> 

Ml  Bn 

q   q 

S(Aq) 

A  01  B 

q   q 

Pq 

R(Bq) 

Ao  B  0-1 

q  q 

Pq 

S(Bq) 

e  e 

Eo  R<We> 

NQ  W  0-1 
e  e 

s(we) 

NQ  W  0-1 
e  p 

S(Hp) 

NQ  B  01 
e  q 

We 

S(Re) 

Ne  Bq@l 

M 

P 

S(Rp) 

Re 

R(we) 

RP 

R(Wp) 

Pq 

R(We)  R(W  ) 

P„  F~ 
q  e 

So 

Po  B 

q  q 

R(Bq) 

Po  B„@l 

q  q 

S(Bq) 
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TABLE  4    GATE  COUNT  FOR  INSTRUCTION  QUEUE 


Symbols 


Description 

Number  of  gates  to  shift 
one  bit 

Number  of  gates  in 
control  bit  logic 

Number  of  bits  in  shift 
control  logic 

Gates  to  store  one  bit 

Number  of  bits  per 
register 


Number  of  registers 
Number  of  control  bits 


Symbol 

Sample  Value 

Explanation 

Ns 

2 

From  Table  3 

CL 

25 

From  Table  3 

SL 

3 

From  Table  3 

Gm 

4 

Nb 

60 

Must  hold  up  to  three 
data  buffer  addresses 
and  an  operation  code 

Nr 

16 

Queue  length 

t 

4 

From  Table  3 

Gate  Estimates 


Functional  Unit 
Shift  Control 
Control  Bit  Logic 
Test  Selector 
Control  Bits 
Register 


Formula 


Vs 

G  N 
m  c 

G  N. 
m  b 


Subtotal 


Sample  Value 
3 

25 

64 

16 

240 

348 


Multiply  by  N  to  get  total  for  queue  =   5568 
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status  tables  required.  We  will  explain  how  the  algorithms  can  be  imple- 
mented within  the  required  time  constraints  and  provide  gate  estimates.  We 
will  not  do  detailed  logical  design  for  these  algorithms. 

The  principal  complication  of  this  unit  is  the  variety  of  possible 
sources  of  operands.  Most  of  these  sources  are  redundant  in  the  sense  that 
their  only  purpose  is  to  allow  for  rapid  processing.  Except  for  timing  con- 
siderations, the  same  operands  would  be  available  from  other  sources.  Since 
only  extensive  experimentation  could  provide  accurate  estimates  of  the  cost 
benefit  tradeoffs,  we  do  not  claim  that  the  ideal  design  would  incorporate 
all  these  features.  We  include  them  as  suggestions  and  state  what  advantages 
they  seem  to  provide. 

The  scalar  buffer  is  the  primary  source  of  operands.  All  operands  with- 
out time  indexes  are  in  the  scalar  buffer.  In  addition,  a  table  provides  a 
list  of  what  time  indexed  operands  are  in  the  scalar  buffer.  It  is  logically 
possible  to  eliminate  all  sources  of  operands  other  than  the  scalar  buffer. 

The  result  buffer  allows  for  the  existence  of  multiple  occurrences  of 
the  same  physical  scalar  buffer  address.  In  addition,  it  eliminates  the  sub- 
stantial delay  between  the  time  all  accesses  to  a  particular  physical  scalar 
address  have  completed  and  the  SEUs  will  be  aware  of  that  fact  and  be  able 
to  overwrite  that  physical  address  with  a  new  result.  It  is  fairly  certain 
that  this  latter  function  of  the  result  buffer  is  essential  to  providing  a 
reasonable  throughput  for  the  scalar  execution  units.  With  it  all  instruc- 
tions can  be  processed  as  soon  as  their  operands  are  available.  The  result 
goes  to  the  result  buffer  unless  the  SEUs  know  the  physical  address  of  the 
result  can  be  overwritten. 
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It  is  less  clear  to  what  degree  the  remaining  two  sources  of  operands 
we  will  now  discuss  are  important  for  efficient  utilization  of  the  SEU. 
Both  of  them  will  help  in  lessening  the  load  on  the  scalar  switch  and  in 
some  instances  providing  for  more  rapid  instruction  processing.  First  we 
consider  the  case  where  a  request  has  been  made  to  transfer  an  operand  on 
the  scalar  switch.  If  the  same  operand  is  required  for  a  queued  instruc- 
tion, a  request  can  be  entered  that  the  operand  also  be  transferred  to  the 
SEU  that  will  be  assigned  the  queued  instruction.  This  is  the  only  method 
the  360/91  uses  in  sequencing  its  various  arithmetic  units.  The  other  case 
occurs  when  a  result  being  computed  is  required  for  a  subsequent  instruc- 
tion. The  controller  can  be  aware  of  this  and  can  simply  transfer  the 
result  to  an  operand  buffer  within  the  SEU.  This  is  likely  to  be  a  desira- 
ble feature  since  it  is  very   common  to  have  the  result  of  one  operation  be 
required  by  the  next.  A  final  possibility  would  be  to  allow  one  to  reuse 
the  operands  of  one  instruction  for  the  next.  We  do  not  include  this  alter- 
native because  of  the  switch  required  within  each  SEU  for  non-symmetric 
operations  and  because  it  is  a  less  likely  occurrence. 

We  will  now  describe  the  tables  required  to  keep  track  of  all  these 
sources  of  operands.  The  scalar  buffer  requires  only  a  single  bit  for  each 
time  index  to  indicate  if  that  operand  is  present.  The  same  is  true  of  the 
result  buffer.  These  bits  are  set  as  results  are  returned  to  the  specified 
buffers  and  reset  as  the  time  indexes  are  recycled.  There  will  exist  short 
queues  to  drive  the  entries  to  the  scalar  switch.  By  allowing  associative 
reads  of  these  queues,  we  can  determine  if  an  operand  is  about  to  appear  in 
the  scalar  switch.  The  final  source  of  operands  is  results  in  the  process 
of  being  computed.  Again,  an  associative  memory  is  required.  Figure  8 
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(Both  operands  are 
done  simultaneously) 


OPERAND  1 


RESET 

0P1 

FLAG 


PHYSICAL 


SET 

0P1,  0P1B 

FLAGS 


TIME  INDEX 


SET 

OP1,  OP1R 

FLAG 


NO 


SET 

OP1,  OP1C 

FLAG 


YES 


SET  OP1,  OP1S  FLAGS; 
ADD  TO  SWITCH  QUEUE; 
REQUEST  TO  GO  TO 
SELECTED  SEU 


FIGURE  8 

ALGORITHMS  FOR  ACCESSING 
SCALAR  STATUS  TABLES 
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OPERAND  2 


RESET 

0P2 

FLAG 


SET 

0P2,  0P2B 

FLAG 


NONE 


PHYSICAL 


SET 

0P2,  OP2B 

FLAGS 


TIME  INDEX 


NO 


SET 

0P2,  OP2R 

FLAG 


NO 


SET 

0P2,  0P2B 

FLAG 


NO 


SET 

0P2,  0P2C 

FLAG 


SET  OP2,  0P2S  FLAG; 
ADD  TO  SWITCH  QUEUE 
REQUEST  TO  GO  TO 
SELECTED  SEU 


FIGURE  8 

ALGORITHMS  FOR  ACCESSING 
SCALAR  STATUS  TABLES  (cont.) 
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SEU  ALLOCATION 


NO 


ALLOCATE  SEU 
CONTAINING 
OP1;   SET  OP1U 


OPT 
X  FLAG 
SET  &  SEU 
CONTAINING  OP1 
MAILABLE 


RESET 
QUEUE 


ALLOCATE  SEU 
CONTAINING 
0P2;  SET 
0P2U 


10 


GET  NEXT 

QUEUE 

ENTRY 


NO 


ALLOCATE 
NEXT 

AVAILABLE 
SEU 


TEST 

OPERANDS 

PRESENT 


YES 


•0 


FIGURE  8 

ALGORITHMS  FOR  ACCESSING 
SCALAR  STATUS  TABLES  (cont.) 
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TABLE  5    SCALAR  UNIT  STATUS  TABLES  SPECIFICATIONS 


Scalar  Buffer  Table 

Size 

Fields 

Parallel  Accesses 

Gate  Count 


256  entries 

1  bit  to  indicate  the  presence  of  each  entry 
6  reads 

2  writes 
4*6*256  =  6144 


II  Result  Buffer  Table 

Same  as  for  scalar  buffer  table 

III  Pending  Requests  to  Use  Scalar  Switch 

This  table  will  be  described  in  Section  4.4.3 


IV   Results  Being  Computed 
Size 
Fields 


Parallel  Accesses 


Gate  Count 


One  entry  for  each  SEU  or  a  total  of  6 

Destination  address  (12  bits) 

Time  index  (8  bits) 

1  bit  indicating  use  result  buffer  or  scalar 
buffer 

6  bits  indicating  other  SEUs  requesting  result 

6  associative  searches  of  time  index 

6  stores  to  SEU  result  request  bit  (each 
SEU  has  its  own  bit) 

1  initial  store  for  all  fields 

(6*12*8  +  6*4  +  27)*6  =  3762 
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describes  the  detailed  algorithms  for  accessing  these  tables.  Table  5 
provides  detailed  specifications  for  these  tables  including  gate  counts  for 
the  tables  and  estimated  gate  counts  for  implementing  the  accessing  algo- 
rithms. 

4.3.3   Scalar  Execution  Unit  Buffers 

In  this  section  we  will  discuss  the  scalar  buffer,  result  buffer,  and 
vector-scalar  buffer  shown  in  Figure  5.  We  will  describe  their  internal 
structure,  their  external  connections,  and  conflict  resolution.  The  memo- 
ries are  organized  into  independently  accessible  modules.  The  number  of 
these  is  determined  by  the  maximum  data  rate  at  which  the  memories  can  oper- 
ate. This  in  turn  is  determined  by  the  data  rates  of  the  SEUs.  Table  6 
provides  this  analysis  for  the  three  buffers.  The  connections  to  the  out- 
side world  include  data  paths  to  the  scalar  switch  and  other  units  as  well 
as  switches  between  the  memories  and  these  data  paths.  Table  6  lists  the 
size  of  the  required  switches.  Conflicts  may  arise  when  there  are  simulta- 
neous requests  to  read  and  write  to  the  same  memory  module  from  the  scalar 
switch.  In  addition,  conflicts  may  arise  between  the  scalar  switch  and 
other  units  requesting  access  to  the  same  memory  module.  Multiple  requests 
for  the  same  memory  module  by  the  scalar  switch  are  resolved  from  within 
the  switch.  All  other  conflicts  are  handled  by  a  simple  rotating  priority 
scheme.  One  of  the  requests  is  given  priority  and  honored;  the  others  must 
wait  until  they  are  given  priority.  The  priority  shifts  between  units  in 
such  a  way  that  they  all  are  given  priority  once  before  any  receives  it  twice 
A  detailed  design  of  this  type  of  priority  logic  for  a  more  complex  case 
will  be  given  in  Section  4.4.2.4.  Table  6  provides  a  gate  count  for  all 
the  logic  discussed. 
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TABLE  6    DESIGN  PARAMETERS  AND  GATE  COUNTS  FOR 
SCALAR  BUFFERS 


Buffer 


Communicates  With 


Scalar  Buffer 

6  SEUs 

Scalar  Buffer 

Result 

Buffer 

Scalar  Buffer 

Vector 

Switch 

TOTALS 

Result  Buffer 

SEUs 

Result  Buffer 

Scalar 

Buffer 

TOTALS 

Vector  Scalar 

Buffer 

SEUs 

VEUs 

TOTALS 

Max  Output  Rate*  Max  Input  Rate* 
12  6** 

0  6** 

0  8 

12  14 


12 

6 

18 

12 

6 

18 


6 
0 
6 

6 

6 

12 


*Data  rates  in  words  per  major  clock, 
**The  sum  of  these  must  be  ^  6. 


The  above  data  rates  are  based  on  the  assumption  of  6  SEUs  operating  at 
full  capacity.  The  maximum  rates  are  in  general  the  maximum  possible  rates 
for  an  individual  unit  and  not  all  maximum  total  rates  could  be  maintained 
simultaneously.  In  converting  these  rates  to  actual  access  rates,  we  take 
full  advantage  of  the  fact  that  these  units  are  8-word  parallel  buffers. 
Conflicts  keep  this  assumption  from  being  totally  correct,  but  given  the 
highly  queued  nature  and  the  inital  remarks  in  this  statement,  the  assump- 
tion seems  reasonable. 


77 


TABLE  6    DESIGN  PARAMETERS  AND  GATE  COUNTS  FOR 
SCALAR  BUFFERS  (cont.) 


SCALAR  UNIT  BUFFER  SPECIFICATIONS 


Buffer 

Scalar 
Buffer 

Result 
Buffer 

Vector- 
Scalar 
Buffer 


Size 


8*256 


8*32 


8*32 


Read/Write 
Rate  per 
Major  Clock 


26 


24 


30 


Rate  per 
Mod  per 
Minor  Clock 


0.41 


0.38 


0.47 


Gate 
Access  Time   Count 
in  Minor     (64  bit 
Clocks      word) 


2  524,288 

2  65,536 

2  65,536 

TOTAL  655,360 


MEMORY  SWITCH  SIZES 


Buffer 


Read 


Write 


Scalar  Buffer 

8x2 

8x2 

Result  Buffer 

8x3 

8  x  1 

Vector  Scalar  Buffer 

8x2 

8  x  1 
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4.3.4   Scalar  Switch 

The  Scalar  Switch  transmits  data  between  the  buffers  and  execution  units 
of  the  scalar  portion  of  the  Computation  Unit.  We  have  discussed  its  func- 
tional operation  in  the  preceding  sections.  In  this  section  we  provide  a 
detailed  design.  Table  7  summarizes  the  data  and  instruction  paths  of  the 
switch.  Figure  9  gives  the  structure  of  a  representative  portion  of  the 
switch.  In  discussing  the  SEU  sequence  controller,  we  did  not  specify  how 
many  SEUs  each  controller  drives.  The  scalar  switch  ties  together  all  the 
previously  discussed  scalar  components  and  thus  at  least  for  the  purpose  of 
providing  gate  estimates,  we  need  to  assume  some  realistic  configuration. 
Thus,  in  this  section  we  have  assumed  three  sequence  controllers  driving  six 
SEUs.  This  would  allow  for  three  independent  types  of  Scalar  Execution  Units. 

The  principal  complexity  of  the  switch  design  results  from  possible 
conflicts.  Such  problems  may  arise  both  from  the  instruction  switch  and 
data  switch.  Conflicts  are  resolved  by  priority  logic  like  that  mentioned 
in  the  previous  section  and  discussed  in  detail  in  Section  4.4.2.4.  Con- 
flicts can  occur  for  any  of  three  reasons.  All  requests  for  accessing  data 
originate  from  the  queues.  These  requests  must  enter  "source  associated" 
queues.  If  more  than  one  attempt  at  a  time  is  made  to  try  and  make  an  entry, 
a  conflict  results.  Another  source  of  conflicts  is  the  simultaneous  attempt 
to  access  the  same  memory  mod.  The  final  source  of  conflicts  is  the  limited 
number  of  ports  into  any  unit.  There  may  be  too  many  simultaneous  requests 
to  use  these  ports.  The  mechanism  for  resolving  all  these  conflicts  involves 
two  basic  mechanisms.  First,  once  a  request  is  made,  the  requesting  unit 
waits  until  it  receives  confirmation.  The  priority  hardware  mentioned  above 
assures  that  this  will  always  happen  fairly  soon.  The  second  principle  is 
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that  requests  are  always  made  and  honored  for  the  earliest  time  at  which 
the  requesting  unit  is  capable  of  honoring  it.  In  other  words,  the  re- 
questing process  is  pipelined  with  the  hardware  that  executes  requests. 

Two  of  the  above  mentioned  conflicts  are  interdependent.  Both  a  memory 
mod  and  a  switch  port  must  be  reserved  in  accessing  any  of  the  buffers. 
This  is  handled  by  not  requesting  a  memory  mod  until  a  port  has  been  re- 
served. It  also  requires  an  additional  minor  clock  of  pipelining  in  the 
requests  for  ports  to  insure  that  they  can  be  used  at  full  capacity.  Table 
8  provides  gate  estimates  for  this  logic. 

We  add  a  final  note  that  all  this  pipelining  and  requesting  circuitry 
is  not  likely  to  create  problems  in  throughput.  This  is  because  the  SEUs 
only  process  one  instruction  per  major  clock,  whereas  all  this  conflict 
resolution  occurs  at  the  minor  clock  rate.  In  addition  the  input  and  output 
of  each  SEU  is  buffered.  Thus  there  should  be  adequate  slack  as  long  as  the 
data  rates  can  be  maintained.  The  most  likely  source  of  trouble  would  be 
poor  distribution  of  data  across  the  buffers.  If  this  proved  to  be  a  problem 
it  could  be  alleviated  by  doubling  the  memory  speed  to  a  1  minor  clock  cycle. 
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TABLE  7    DATA  AND  INSTRUCTION  PORTS 
FOR  THE  SCALAR  SWITCH 


DATA  PORTS 

Unit 

Input  Ports 

Output  Ports 

6  SEUs 

6 

6 

Scalar  Buffer 

2 

2 

Result  Buffer 

1 

3 

Vector-Scalar 

Buffer 

1 

2 

TOTALS 

10 

13 

See  Table    for  the  source  of  these  figures 


INSTRUCTION  PORTS 

There  are  three  SEU  queues,  each  of  which  has  a  path  to  all  three  buffers, 
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TABLE  8    GATE  COUNT  FOR  SCALAR  SWITCH 


Unit  Number 

Source  Queue  2 

(examines  2  entries 
simultaneously) 

Source  3x2  Switch      2 

Source  Queue  5 

Control 

Source  Queue  1 


Source  3x1  Switch 


Gate 
Estimate 

Source  of  Estimate 

Total 

2  400 

4  entries  in  queue, 
Section  4.3.2.1 

4  800 

480 

20  bit  instructions 

960 

1  000 

Figure 

5  000 

1  200 

4  entries  in  queue, 
Section  4.3.2.1 

1  200 

240 

20  bit  instructions 

240 

Local  Data  Switches: 


SEU  1x2 

6 

256 

64  bit  word 

Scalar  and  Result 
Buffers  8x5 

2 

10 

240 

64  bit  word 

Vector-Scalar  Buffer 
8x3 

1 

6 

144 

64  bit  word 

Local  Switch  Controls 

3 

3 

000 

Figure 

Scalar  Switch  10x13 

1 

39 

520 

64  bit  words 
12  bit  addresses 

1  536 
20  480 

6  144 

9  000 
39  520 


TOTAL 


88  880 
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4.4   VECTOR  PORTION  OF  COMPUTATION  UNIT 

The  vector  portion  of  the  execution  unit  is  intended  to  do  the  bulk 
of  the  actual  processing.  The  primary  purpose  of  the  scalar  unit  just  dis- 
cussed is  to  avoid  having  to  use  full  vector  processors  when  these  are  not 
required.  The  justification  for  having  the  vector  units  is  a  combination 
of  utility  and  economy.  We  know  that  most  FORTRAN  programs  can  effectively 
utilize  vector  units  of  at  least  width  8.  We  have  already  observed 
in  Section  4.3  the  substantial  overhead  that  is  involved  with  the  queue 
driven  and  pipelined  approach.  By  allowing  each  instruction  to  drive  the 
equivalent  of  8  parallel  execution  units,  we  minimize  the  cost  of  this  over- 
head. The  tradeoff  in  determining  how  wide  such  units  should  be  is  overhead 
cost  versus  utilization  of  the  potential  parallelism.  As  discussed  in  Sec- 
tion 3.3,  we  consider  the  whole  area  of  parallel  computing  to  be  in  a  wery 
primitive  state.  Thus,  we  justify  the  width  we  have  chosen  solely  on  the 
grounds  that  we  know  it  will  work  for  a  very   broad  class  of  problems  and  we 
can  implement  it  with  an  acceptable  level  of  overhead.  We  do  not  wish  to 
enter  into  the  extraordinarily  complex  question  of  quantifying  the  tradeoffs. 
We  will  now  discuss  the  overall  structure  of  the  vector  unit. 

4.4.1   Overall  Structure  of  Vector  Portion  of  Computation  Unit 

Those  portions  of  Figure  2  that  constitute  the  vector  unit  are  the 
VEUs,  the  Vector  Switch,  and  the  Vector  Buffer.  The  VEUs  perform  the  actual 
vector  processing  and  are  fairly  complex  units  containing  instruction  queues, 
buffers,  and  other  hardware  in  addition  to  that  that  does  the  actual  computa- 
tion. The  Vector  Buffer  acts  as  a  back-up  reserve  storage  for  the  buffers 
within  the  VEUs.  The  Vector  Switch  is  responsible  for  transferring  data 
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among  these  units  and  between  them  and  the  memory.  In  addition,  it  can 
transmit  data  to  the  Scalar  Buffer  and  to  the  MIDs.  It  also  contains  its 
own  internal  queues.  We  will  now  describe  each  of  these  units  in  detail. 

4.4.2   Vector  Execution  Units 

Figure  10  gives  the  overall  structure  of  a  typical  VEU.  The  unit  is 
controlled  by  the  sequencer  which  reads  instructions  from  the  instruction 
queue.  The  sequencer  tests  successive  entries  in  the  queues  until  one  is 
encountered  with  all  operands  present  in  the  operand  buffer.  The  sequencer 
will  then  set  up  this  instruction  to  commence  execution  as  soon  as  the  cur- 
rent instruction  is  finished.  The  access  controllers  resolve  any  conflicts 
that  may  occur  in  accessing  the  buffers.  The  hardware  associated  with  the 
internal  switch  allows  for  results  to  be  used  as  operands  without  going 
through  the  Vector  Switch.  We  will  discuss  all  the  units  of  Figure  10  in 
more  detail  in  subsequent  sections.  We  will  first  discuss  the  various  possi- 
bilities for  the  computation  hardware. 

4.4.2.1   Standard  Arithmetic  Units 

The  computation  hardware  may  be  a  standard  arithmetic  unit.  In  other 
words,  it  may  perform  standard  floating  point  and  fixed  point  arithmetic  and 
logical  operations.  We  will  not  discuss  the  logic  to  do  arithmetic  or  simi- 
lar operations.  There  are  a  number  of  ways  in  which  the  parallel  units  can 
be  organized.  We  will  discuss  several  of  these  alternatives  and  their  ad- 
vantages and  limitations. 

The  simplest  structure  would  be  an  ILLIAC  IV  type  of  parallelism.  That 
is,  eight  units  driven  by  a  single  control  sequencer  such  that  the  units  all 
perform  identical  operations.  An  alternative  method  of  implementing  the 
same  logical  structure  would  be  an  eight  stage  pipeline.  Since  data  transfers 
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are  pipelined  within  the  unit  as  we  have  discussed,  a  pipelined  arith- 
metic unit  would  fit  in  quite  nicely.  In  the  case  of  parallel  units,  we 
would  have  to  phase  them  or  buffer  them  to  accommodate  the  pipelined  data 
transfers.  The  advantages  of  this  type  of  parallelism  are  that  fewer  gates 
are  required  for  control  purposes  and  the  instructions  are  relatively  sim- 
ple. The  disadvantage  is  the  lack  of  flexibility.  The  statement  that  most 
FORTRAN  programs  can  effectively  utilize  a  parallelism  of  eight  was  based  on 
the  assumption  that  each  unit  can  perform  a  different  arithmetic  operation 
in  parallel . 

The  next  level  of  complexity  is  to  provide  eight  arithmetic  units,  each 
capable  of  performing  a  different  operation  in  parallel.  To  control  these 
units,  we  would  require  an  extended  instruction  with  at  least  two  bits  of 
control  information  for  each  arithmetic  unit. 

The  next  level  of  complexity  is  to  consider  tree  processors  [6]. 
This  is  basically  a  set  of  arithmetic  units  interconnected  to  form  a  tree 
structure.  Given  our  basic  width  of  eight,  we  could  implement  trees  with  a 
base  of  four  pairs  of  operands  or  eight  pairs  of  operands.  In  the  first 
case,  the  instructions  would  be  regarded  as  having  a  single  8-word  wide 
operand.  In  the  latter  case,  we  would  have  two  8-word  wide  operands.  The 
advantage  of  a  tree  structure  is  that  it  allows  \jery   efficient  execution  of 
an  arithmetic  expression.  Two  disadvantages  are  that  most  arithmetic  expres- 
sions are  not  full  trees,  and  the  unit  must  be  pipelined  to  operate  at  full 
efficiency.  Perhaps  the  most  flexible  unit  would  be  one  that  allowed  four 
pairs  of  operands  at  the  base  and  could  also  be  used  to  operate  on  pairs  of 
eight  wide  vectors.  One  advantage  of  our  approach  is  that  various  combina- 
tions of  these  alternatives  could  be  tried  out  after  a  machine  had 
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been  constructed.  There  is  no  need  to  go  into  a  detailed  and  abstract 
analysis  of  these  alternatives  at  this  stage. 

There  is  an  important  structural  difference  between  vector  and  tree 
units  that  we  must  consider  at  this  stage.  This  is  that  tree  units  have  a 
scalar  output  and  vector  units  a  vector  output.  There  are  three  possible 
destinations  for  such  a  scalar.  These  are  the  scalar  portion  of  the  Compu- 
tation Unit,  the  MID,  and  a  vector  operand  as  one  element  of  it.  In  this 
latter  case,  the  vector  may  be  used  in  full  vector  computations  and/or  be 
used  in  more  tree  computations.  We  need  to  include  hardware  to  accommodate 
these  possibilities.  We  will  now  discuss  each  of  these  alternatives. 

We  have  already  mentioned  in  Section  4.3  how  some  scalar  buffer  addresses 
refer  to  data  from  outside  the  scalar  buffer.  Results  of  vector  instructions 
that  have  the  scalar  unit  as  destination  need  to  be  sent  to  the  above  men- 
tioned portion  of  the  scalar  buffer.  The  logic  for  handling  conflict  reso- 
lution for  multiple  VEUs  will  be  in  the  scalar  unit  as  discussed  in  Section 
4.3.3.3.  The  VEU  need  only  interpret  this  destination  from  the  queue  in- 
struction and  send  the  data  and  its  destination  address  out  over  the  appro- 
priate path. 

The  same  procedure  can  be  followed  in  the  case  of  data  headed  for  the 
MID.  We  do  not  do  a  detailed  logical  design  of  the  MID,  but  the  techniques 
discussed  in  Section  3.2.1.3  can  be  used  to  handle  conflict  resolution. 

The  final  alternative  we  have  to  consider  is  that  of  scalars  that  are 
to  become  part  of  a  vector.  We  require  special  hardware  within  the  vector 
portion  of  the  computation  unit  for  this  case.  This  must  include  a  scalar 
switch  to  transfer  the  data  to  the  correct  position  in  the  destination  vector 
and  controls  that  are  able  to  recognize  when  a  complete  vector  has  been 
assembled  and  is  available  for  further  processing.  In  the  case  of  a  single 
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SCALAR  SOURCE 
and  address 


""8x1 

VECTOR 

SWITCH 

BUFFER 

VECTOR 
OUTPUT 

PATH 

CONTROL 


VECTOR  ELEMENT 
PRESENCE  BITS 


VECTOR  PRESENCE 
BITS 


When  all  8  Vector  Element  Presence 
Bits  are  set  for  a  single  vector, 
then  the  corresponding  Vector 
Presence  Bit  is  set. 


FIGURE  11   ASSEMBLING  SCALARS  INTO  A  VECTOR 
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tree  unit,  it  would  probably  be  desirable  to  have  this  hardware  within  that 
unit.  With  multiple  units  we  would  probably  want  this  hardware  as  part  of 
the  vector  buffer.  Figure  11  describes  this  logic  and  Table  9  provides 
gate  estimations. 

Another  type  of  tree  we  might  wish  to  include  is  the  if  tree  analyzer 
[3].  This  would  be  especially  helpful  in  reducing  the  amount  of  non- 
determinism  in  a  program.  Our  highly  pipelined  structure  makes  this  espe- 
cially desirable.  Including  such  a  feature  is  also  necessary  to  obtain  the 
theoretical  speed  of  FORTRAN  programs  we  have  discussed  earlier.  Function- 
ally, the  if  tree  analyzer  is  no  different  than  the  trees  we  have  discussed 
except  that  it  has  a  single  output  that  goes  to  the  MID. 


TABLE  9   GATE  COUNT  FOR  SCALAR  ASSEMBLING  UNIT 

Unit 

8  x  1  Switch 

Vector  Buffer  (8  vectors) 

Vector  Element  Presence  Bits 

Control 

Vector  Presence  Bits 

TOTAL       25,000 


Gate  E: 

stima 

te 

2 

200 

20 

000 
600 

2 

000 
200 
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4.4.2.2  Vector  Routers 

At  least  one  of  the  VEUs  will  be  devoted  to  a  vector  router  or  full 
crossbar  switch.  Given  the  relatively  small  size  of  our  vectors  and  the 
broad  spectrum  of  permutations  that  various  algorithms  may  require,  a  full 
crossbar  switch  is  justified.  In  addition  to  allowing  the  arbitrary  permu- 
tations of  a  vector,  this  unit  should  allow  for  the  combining  of  two  oper- 
ands under  mode  control.  It  would  also  be  desirable  to  allow  selective 
partial  broadcasting.  Mode  bits  and  routing  patterns  may  either  be  included 
as  part  of  the  instruction  or  be  dynamically  computed  within  the  EUs.  Arbi- 
trary broadcasting  patterns  contain  too  many  bits  to  be  part  of  the  instruc- 
tion. At  least  in  the  case  of  modes  and  possibly  in  other  cases,  we  would 
wish  to  allow  scalar  operands.  Thus,  we  have  the  inverse  case  of  that 
discussed  in  the  previous  section.  We  need  to  obtain  a  scalar  from  the  sca- 
lar portion  of  the  computation  unit.  This  can  be  accomplished  by  issuing  a 
scalar  instruction  to  store  the  required  operand  in  that  portion  of  the 
Scalar  Buffer  that  allows  for  transmission  to  the  VEUs.  Some  of  those  ad- 
dresses may  be  regarded  as  operands  within  the  VEU,  and  a  store  to  them 
results  in  the  transfer. 

4.4.2.3  Other  Vector  Units 

There  is  no  need  to  limit  the  computation  hardware  to  the  alternatives 
just  discussed.  Even  after  the  machine  has  been  constructed,  different 
softs  of  units  could  be  added  or  used  to  replace  existing  units.  In  the 
next  section  we  will  discuss  in  detail  the  internal  queues  switches  and  con- 
trol for  a  single  VEU.  All  of  this  hardware  could  be  used  "as  is"  for  any 
type  of  VEU.  Not  all  of  it  will  necessarily  be  included  in  every  VEU. 
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The  point  is  that  at  any  time  we  could  add  specialized  hardware  without 
designing  more  than  that  hardware  and  some  very   simple  interfaces. 

4.4.2.4   Detailed  Internal  Structure  of  a  VEU 

In  this  section  we  will  finish  our  discussion  of  the  remaining  units 
of  Figure  10.  We  now  list  those  units  requiring  elaboration.  The  instruc- 
tion queue  is  logically  the  same  as  that  discussed  in  Section  4.3.2.2.  The 
operand  and  result  buffers  operate  in  a  special  phased  array  fashion  which 
we  will  describe  in  detail.  We  will  provide  detailed  design  for  general 
purpose  access  controllers  which  we  have  referred  to  in  previous  sections. 
The  boxes  associated  with  the  internal  switch  will  require  no  new  specialized 
design.  We  will  provide  gate  estimates  for  the  entire  unit  at  the  end  of 
this  section. 

We  begin  our  detailed  design  with  the  phased  array  buffers.  This  memo- 
ry is  eight  modules  wide  corresponding  to  our  vector  width.  All  accesses 
are  to  a  single  vector  stored  in  the  same  relative  position  across  the  memo- 
ry. The  data  paths  themselves  are  only  one  word  wide,  and  so  the  data  trans- 
fer must  be  pipelined.  Once  an  access  has  started  for  one  vector  in  the 
memory,  we  cannot  always  afford  to  wait  eight  clocks  before  starting  a  new 
access.  Thus,  we  essentially  shift  the  decoded  address  from  one  memory 
module  to  the  next.  We  do  this  in  a  manner  that  allows  a  new  address  to 
enter  at  any  clock.  Similarly,  we  have  a  switch  that  allows  the  data  to  be 
transferred  to  several  places.  The  addresses  for  this  switch  are  shifted  in 
parallel.  Figure  12  shows  the  structure  of  such  a  unit.  Table  10  describes 
its  operation. 


92 


DATA  PATH 
ADDRESS 


MEMORY 
ADDRESS 


ADDRESS 
DECODER 


ADDRESS 
DECODER 


linn 

DATA  PATH 
FAN  OUT 


SLOTTED  ADDRESSING  UNIT 


I  i  i  i         ii  i 


'•  i  


SLOTTED  ADDRESSING  UNIT 


MEMORY 


FIGURE  12   DATA  BUFFER 
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TABLE  10   DATA  BUFFER  OPERATION 


Minor  Cycle    Event 

0  Two  addresses  simultaneously  enter  the  decoders  and  are 
decoded. 

1  The  enable  patterns  enter  the  addressing  units  and  cause 
the  selected  memory  location  to  be  switched  to  the  correct 
data  path.  A  new  pair  of  addresses  are  decoded. 

2  The  enable  pattern  in  the  addressing  unit  is  shifted  one 
to  the  right  and  a  new  pattern  takes  its  place.  These  are 
both  used  to  switch  two  words  to  two  data  paths.  A  third 
address  enters  the  addressing  unit. 

3  All  enable  patterns  are  shifted  one  right.  A  new  address 
enters  the  address  decoder  and  a  new  enable  pattern  the 
addressing  unit.  Three  words  are  accessed. 

4  Same  as  minor  cycle  3,  but  four  words  are  accessed. 

5  Same  as  minor  cycle  3,  but  five  words  are  accessed. 

6  Same  as  minor  cycle  3,  but  six  words  are  accessed. 

7  Same  as  minor  cycle  3,  but  seven  words  are  accessed. 

8  Same  as  minor  cycle  3,  but  eight  words  are  accessed. 

9  Same  as  minor  cycle  8,  except  the  right-most  address  is 
dropped. 

etc.       Operation  continues  in  this  manner.  During  any  minor 

cycle  in  which  no  new  addresses  are  presented,  there  will 
be  a  vacant  slot  that  will  move  to  the  right  in  the  same 
manner  as  the  enable  patterns.  Along  with  the  enable 
patterns,  there  is  a  single  bit  which  indicates  if  a  read 
or  store  is  being  performed. 
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We  now  turn  to  the  general  problem  of  designing  priority  access  con- 
trollers. The  logic  we  design  must  be  fast  enough  to  operate  in  one  minor 
clock.  It  must  treat  all  requests  equally.  It  must  ensure  that  each  re- 
questing unit  receives  top  priority  once  before  any  unit  receives  it  twice. 
Finally,  the  design  should  be  general  enough  to  accommodate  any  number  of 
requesting  units  up  to  32.  Actually,  no  application  within  this  machine 
will  require  that  many  units,  but  our  design  will  meet  the  other  constraints 
for  up  to  that  many  units.  We  will  first  describe  our  algorithm  for  the 
case  when  the  number  of  requesting  units  is  a  power  of  2.  We  will  then  show 
how  the  algorithm  can  be  modified  to  handle  the  remaining  cases. 

In  describing  the  power  of  2  case,  we  will  assume  8  requesting  units. 
It  will  be  obvious  how  to  generalize  to  larger  or  smaller  powers  of  2. 
Functionally,  our  unit  is  presented  with  8  bits,  any  combination  of  which 
may  be  set.  We  must  produce  an  output  of  8  bits,  only  one  of  which  is  set. 
This  bit  must  correspond  to  one  of  the  bits  that  was  originally  set.  Over 
a  period  of  time,  the  selection  process  must  conform  to  the  requirements 
listed  above. 

Physically,  the  unit  consists  of  three  levels  or  log  base  2  of  the  num- 
ber of  bits  in  the  general  case.  At  the  first  level,  there  are  4  two-state 
devices  through  which  pairs  of  bits  pass.  At  each  level  the  number  of  de- 
vices is  halved,  and  the  number  of  bits  passing  through  each  device  doubles. 
All  the  devices  have  two  states.  The  bits  passing  through  each  device  are 
divided  into  two  groups.  The  device  will  pass  on  to  the  next  stage  a  one 
bit  from  only  one  of  these  groups.  The  two  groups  passing  through  a  single 
device  form  a  single  group  for  the  next  stage.  Thus,  by  induction,  each 
group  has  at  most  one  bit  set.  The  choice  of  which  group  to  pass  on  is  a 
function  of  the  state  of  the  device  and  its  input.  The  state  of  the  device 
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indicates  a  preference  for  one  or  in  the  other  state  the  other  group.  By 
preference,  we  mean  simply  if  the  preferred  group  has  a  one  in  it,  that  one 
is  passed  on,  otherwise  the  other  group's  one  is  passed  on.  Of  course,  if 
neither  group  has  a  one,  then  nothing  is  passed  on.  Figure  13  gives  an 
example.  It  should  be  clear  that  only  of  the  originally  set  bits  can  emerge. 
By  changing  the  devices'  states  in  an  appropriate  sequence,  we  are  able  to 
provide  the  uniform  scheduling  required.  The  lowest  level  changes  state  at 
every   clock.  Each  higher  level  changes  state  in  twice  the  number  of  clocks 
as  the  next  lowest  state.  Thus,  every   bit  position  will  go  through  all 
eight  priority  states  in  every   eight  clocks.  Figure  13  illustrates  this. 

We  now  consider  the  case  where  the  number  of  requesting  units  is  not  a 
power  of  2.  We  start  with  the  design  just  described  for  the  smallest  power 
of  2  greater  than  the  number  we  are  considering.  By  appropriately  allocating 
the  requesting  units  to  the  excess  available  slots,  by  sequencing  the  en- 
tire unit  correctly,  and  by  allowing  some  requesting  units  to  use  either  of 
two  slots,  we  can  meet  the  design  constraints.  We  will  present  an  informal 
constructive  proof. 

First,  we  precisely  restate  the  problem.  We  have  N  units.  We  must  con- 
struct a  circuit  which,  when  presented  with  N  bits,  any  subset  of  which  may 
be  set,  will  select  a  single  bit.  It  must  perform  this  selection  on  a  prio- 
rity basis.  These  priorities  must  rotate  in  a  way  that  any  bit  will  go 
through  all  possible  priorities  from  1  to  N  before  any  priority  is  repeated. 
We  have  already  demonstrated  how  to  construct  such  a  circuit  when  N  is  a 
power  of  2.  We  will  prove  the  more  general  case  by  induction.  It  is  clear 
we  can  construct  such  a  circuit  for  N  =  1.  Now  we  assume  we  can  construct 
the  circuit  for  all  integers  less  than  or  equal  to  M,  the  greatest  power 
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of  2  that  is  less  than  N.  We  will  use  these  circuits  to  show  how  to  con- 
struct a  circuit  for  all  integers  less  than  or  equal  to  M.  We  need  to 
consider  two  cases,  N  even  and  N  odd. 

First,  if  N  is  even,  we  use  two  circuits  of  size  N/2.  We  then  add  one 
additional  level  that  chooses  between  the  outputs  of  these  two  circuits. 
By  varying  this  selection  choice  e\/ery   N/2  selections,  we  have  the  desired 
circuit. 

The  case  where  N  is  odd  is  somewhat  more  complex.  Let  K  be  such  that 
N  =  2K  +  1 .  We  begin  with  two  circuits  of  size  K  +  1.  Again,  we  will  add 
an  additional  level  to  select  between  these.  We  will  assign  K  of  the  in- 
puts to  circuit  A  and  K  +  1  to  circuit  B.  In  addition,  we  install  binary 
switches  to  allow  any  of  the  K  +  1  inputs  of  B  to  use  the  vacant  input  of 
A.  We  will  need  to  assume  that  a  circuit  of  size  L  +  1  and  L  even  can  be 
used  with  L  inputs.  This  is  clearly  true  f or  L  =  1 .  It  will  be  true  for 
larger  L  because  of  the  way  these  circuits  are  constructed  out  of  smaller 
circuits  as  we  have  outlined  above.  In  particular,  the  circuit  of  size 
L  +  1  is  made  up  of  two  circuits  of  size  L/2  +  1 ,  with  a  global  level  for 
selecting  between  them.  Clearly,  if  these  smaller  circuits  can  be  made  to 
work  for  size  L/2,  then  we  can  sequence  the  larger  circuit  in  a  way  that  it 
will  work  for  size  L.  Thus,  given  the  way  we  are  constructing  our  circuit 
of  size  N,  we  may  assume  the  circuits  of  size  K  +  1  can  work  for  K  inputs. 
We  now  proceed  with  the  construction  of  our  size  N  circuit. 

For  the  first  K  states  of  the  entire  device,  we  sequence  A  and  B  in 
any  way  that  insures  no  input  will  have  its  priority  repeated.  We  do  this 
by  giving  B  highest  priority  at  the  highest  added  level  of  the  circuit  and 
by  sequencing  A  and  B  individually  so  they  do  not  repeat  any  states. 
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For  the  remaining  K  +  1  states,  we  must  give  priority  to  A  and,  in  addi- 
tion, during  each  of  these  states,  switch  one  of  the  B  inputs  into  the 
vacant  spot  of  A.  We  need  to  pay  special  attention  to  the  element  assigned 
priority  K  +  1  during  each  of  these  remaining  states.  All  but  one  (we  will 
call  it  Z)  of  the  elements  of  B  had  that  priority  during  the  first  K  states. 
We  will  sequence  A  as  if  it  were  an  ordinary  K  +  1  state  device  during  these 
remaining  states.  Thus,  during  the  state  in  which  the  vacant  input  of  A  has 
priority  K  +  1 ,  we  must  switch  Z  into  the  vacant  state.  During  each  of  the 
other  K  final  states,  we  must  switch  a  different  element  of  B  into  A.  The 
element  that  must  be  switched  during  each  of  these  other  states  is  also 
uniquely  determined.  Whatever  priority  the  vacant  element  receives  will 
correspond  to  a  priority  already  assigned  to  K  of  the  elements  of  B.  Thus, 
the  unique  remaining  element  must  always  be  switched.  Thus,  to  complete 
the  proof,  we  must  show  how  we  can  do  this  switching  and  still  insure  that 
none  of  the  last  priorities  will  be  repeated.  The  algorithm  for  this  is 
quite  simple.  We  begin  with  any  correct  K  +  1  sequence  for  B.  We  choose 
row  X  from  this  sequence  as  the  one  to  be  switched.  In  other  words,  we  will 
always  switch  the  element  of  B  that  would  have  been  assigned  priority  X 
within  B.  This  assures  us  that  none  of  the  last  K  priorities  will  be  re- 
peated. We  note  that  we  can  arbitrarily  permute  the  sequence  in  which 
these  states  occur.  Thus,  we  permute  them  in  such  a  way  to  insure  that 
the  element  of  B  having  priority  X  is  the  unique  element  that  must  be 
switched  during  each  of  the  last  K  +  1  states.  This  completes  the  construc- 
tion. Figure  14  gives  an  example  for  N  =  7  and  also  provides  a  count  of 
the  number  of  switches  for  arbitrary  N.  Table  11  provides  a  summary  gate 
count  for  the  VEU. 
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Gate  Count  Summary 

First  we  consider  N  a  power  of  2. 

Gate  count  for  basic  selection  logic  is  8. 

Thus  for  N  =  2  we  have 

£  (8+2)  +  |  (8+4)  +  I   (8+8)  ...  (8+2K)  =  8(N-1)  +  N  -  K 

For  N  not  a  power  of  2,  the  gate  count  is  <  the  gate  count  for  M  the 
smallest  power  of  2  <  N  plus  twice  the  gate  count  for  N-l  switches  o\ 
(2K  =  M).  Thus  the  total  is: 


<  8(M-1)  +  M  -  K  +  (N-l)8 
*  8(N+M-2)  +  M-K 


FIGURE  14   N0N-P0WER-0F-2  PRIORITY  SELECTOR  (cont.) 
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TABLE  11   VEU  GATE  COUNT 


(Gate  counts  for  the  units  in  Figure  10) 


Unit 

Operand  Buffer 

Result  Buffer 

Instruction  Queue 

Internal  Switch  Queue 

Operand  Buffer  Access 
Controller 

Result  Buffer  Access 


Source  of  Gate 

Estimate 

Gate  Count 

8x16x64  bits  + 

addressing  logic 

40  000 

8x16x64  bits  + 

addressing  logic 

40  000 

Table  4 

5  600 

Table  4 

5  600 

Figure  14 


300 


Controller 

Figure  14 

300 

Sequencer 

Estimate  based  on  function 

1 

000 

Internal  Switch 
Controller 

Estimate  based  on  function 

1 

000 

Internal  Switch 

64  bit  words 

256 

TOTAL 

54 

056 
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4.4.3   Vector  Buffer 

The  Vector  Buffer  serves  two  purposes.  It  provides  a  source  of  oper- 
ands that  can  be  used  by  multiple  instructions  without  accessing  main  memory. 
In  addition,  it  provides  space  where  intermediate  results  can  be  stored. 
These  values  are  also  stored  within  the  VEUs,  but  the  number  allowed  in  a 
single  VEU  is  quite  small,  probably  16.  The  detailed  allocation  of  Vector 
Buffer  storage  is  handled  by  the  IUD.  In  this  section  we  will  provide  a 
general  functional  discussion  of  this  storage  allocation  and  provide  a  de- 
tailed design  of  the  Vector  Buffer  itself. 

In  Section  2  where  we  described  OFFL,  we  noted  that  all  instructions 
which  perform  operations  have  addresses  referring  to  an  intermediate  buffer. 
All  loads  and  stores  to  main  memory  are  to  locations  in  this  virtual  buffer. 
The  physical  buffer  corresponding  to  this  virtual  buffer  is  distributed  with- 
in the  VEUs  and  the  Vector  Buffer.  These  virtual  locations  can  be  divided 
into  two  classes:  those  that  were  initially  defined  by  an  instruction  to 
load  from  memory,  and  those  that  were  defined  as  the  result  of  some  opera- 
tion. All  of  the  first  class  are  assigned  space  in  the  Vector  Buffer.  All 
in  the  second  class  are  initially  assigned  space  within  the  VEU  that  is 
assigned  the  corresponding  instruction.  Elements  from  the  second  class  will 
be  transferred  to  the  Vector  Buffer  if  that  is  necessary  to  keep  the  storage 
space  within  the  VEU  from  being  exhausted. 
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We  will  refer  to  the  Vector  Buffer  plus  the  storage  space  within  the 
VEUs  for  results  as  the  total  vector  buffer.  Once  a  physical  address  with- 
in the  total  vector  buffer  has  been  allocated,  it  must  remain  allocated 
until  the  corresponding  virtual  address  is  reused.  Once  the  virtual  address 
is  reused  and  all  instructions  with  pending  request  for  the  corresponding 
physical  address  have  completed  accessing  this  physical  address,  it  may  be 
reused.  The  contents  of  that  physical  location  are  no  longer  accessible  by 
any  executing  program.  It  would  be  possible  to  keep  an  associative  memory 
that  relates  such  buffer  locations  to  main  memory  and  thus  in  some  instances 
possibly  save  some  memory  accesses.  We  do  not  include  this  option  as  part 
of  our  design,  because  it  does  not  appear  to  us  as  providing  much  of  a 
return  for  the  logic  that  would  be  required,  given  our  overall  structure. 

Clearly,  it  is  essential  that  the  number  of  available  virtual  addresses 
not  exceed  the  number  of  physical  locations  within  the  total  vector  buffer. 
Given  the  highly  pipelined  nature  of  the  machine  and  the  inevitable  delays 
between  the  time  when  a  virtual  address  is  reused  and  the  time  that  all 
pending  instructions  have  completed  access  to  the  corresponding  physical 
address,  we  require  an  excess  of  physical  locations  to  logical  or  virtual 
locations.  We  will  first  discuss  the  number  of  virtual  locations  likely  to 
be  desirable  and,  on  this  basis,  estimate  a  reasonable  number  of  physical 
locations. 

In  determining  the  virtual  buffer  size,  we  will  concentrate  on  the 
pipelining  delay  between  memory  and  the  VEUs.  Considering  other  aspects 
which  are  program  dependent  makes  the  problem  extremely  complex.  Further, 
our  queued  and  pipelined  structure  is  intended  to  ameliorate  such  problems 
across  a  broad  spectrum  of  programs.  Thus,  it  is  reasonable  to  concentrate 
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our  attention  on  the  buffer  size  required  to  keep  the  pipe  flowing.  The 
essential  constraint  in  determining  this  will  be  the  time  for  a  transfer 
from  a  VEU  to  primary  memory  and  back  to  the  VEU.  We  need  enough  virtual 
memory  space  to  insure  that  the  memory  value  that  is  reused  within  this 
time  interval  can  be  left  in  virtual  storage.  This  leads  us  to  the  obser- 
vation that  the  size  of  the  virtual  buffer  is  primarily  dependent  on  the 
rate  of  reuse  of  memory  locations  within  the  specified  delay  time.  This 
time  cannot  be  computed  exactly,  but  we  can  provide  a  rough  conservative 
estimate.  The  overall  delay  is  a  sum  of  the  following  delays: 

1.  Delay  in  the  Vector  Switch  queue     (4)    4.4.4 

2.  Delay  in  the  Vector  Switch         (8)    4.4.4 

3.  Delay  in  the  memory  buffer  queue    (4)    4.5 

4.  Delay  in  the  memory  switch        (11)    4.5 

5.  Delay  in  the  memory  page  buffer     (4)    4.5 

6.  Delay  in  memory  store  (8)    4.5 

The  number  in  parentheses  is  the  delay  in  minor  clocks.  The  second  number 
is  the  section  in  which  the  unit  is  discussed  in  detail. 

The  total  delay  is  twice  the  sum  of  the  individual  delays  plus  an  addi- 
tional trip  through  the  Vector  Switch  or  82  minor  clocks.  One  VEU  can 
generate  10  results  in  this  time  (one  every  8  minor  clocks).  Thus,  our  6 
VEUs  can  generate  roughly  60  results.  Thus,  64  would  be  a  reasonable  con- 
servative size  for  the  virtual  vector  buffer.  This  estimate  would  be  ade- 
quate for  a  100  percent  reuse  of  memory  values  within  the  specified  delay. 
Since  the  number  of  locations  required  for  this  assumption  is  reasonably 
small  and  this  case  may  be  approximated  over  some  program  segments,  it  is 
reasonable  to  allow  for  this  worst  case. 
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We  now  turn  our  attention  to  the  physical  buffer  size  required  to 
achieve  the  specified  virtual  buffer  size.  The  primary  factor  we  have  to 
consider  here  is  the  delay  between  the  time  a  virtual  memory  location  is 
reused  in  the  instruction  stream  and  the  time  the  corresponding  physical 
location  can  be  reallocated.  The  start  and  end  of  this  delay  refers  to  the 
IUD.  More  specifically,  it  is  the  time  beginning  when  the  IUD  notes  that 
an  instruction  reuses  allocated  virtual  memory  address  and  the  time  the  IUD 
is  able  to  reallocate  that  address.  The  total  time  for  this  process  is  a 
function  not  only  of  the  various  pipe  and  queue  delays,  but  also  a  function 
of  the  total  number  of  pending  requests  to  access  the  virtual  memory  loca- 
tion when  it  is  reused.  Instead  of  explicitly  considering  this  case,  we 
will  consider  a  particular  case  for  which  the  delays  are  relatively  easy  to 
estimate  and  which  should  in  most  instances  be  the  worst  case. 
We  will  assume  the  following  OFFL  instruction  sequence: 

LOAD  A  to  Tl 

LOAD  B  to  T2 

COMPUTE  T3  from  A  and  B 

Instruction  which  uses  T3 

Instruction  which  reallocates  virtual  location  T3 
In  addition,  we  assume  A  and  B  are  in  the  same  memory  page.  Because  accesses 
to  virtual  memory  locations  are  buffered  within  each  VEU,  it  is  unlikely 
that  these  accesses  will  be  delayed  by  a  greater  time  than  that  required  to 
fetch  a  single  operand  from  memory.  We  now  estimate  the  dealys  encountered 
by  the  above  sequence.  Again  we  give  the  time  in  minor  clocks  and  the  section 
in  which  the  unit  performing  the  function  is  described  in  detail. 
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1.  IUD  delay  to  complete  processing  memory  instructions   (8)  4.6 

2.  Delay  in  switching  instruction  into  memory  page  queue  (5)  4.5 

3.  Delay  in  memory  page  queue  (4)  4.5 

4.  Delay  in  accessing  memory  (8)  4.5 

5.  Delay  in  memory  switch  (11)  4.5 

6.  Delay  in  Vector  Switch  queue  (4)  4.4.4 

7.  Delay  in  Vector  Switch  (8)  4.4.4 

8.  Delay  in  Vector  Switch  queue  (4)  4.4.4 

9.  Delay  in  Vector  Switch  (8)  4.4.4 

10.  Additional  delay  to  access  the  second  operand  (8) 

11.  Delay  in  VEU  queue  (32)  4.4.2.4 

12.  Computation  time  (8)  4.4.2.4 

13.  Delay  in  Vector  Switch  queue  (4)  4.4.4 

14.  Delay  in  Vector  Switch  (8)  4.4.4 

15.  Time  to  transmit  information  about  available 

virtual  location  to  VIDS  (8)  4.6.6 

16.  Time  to  transmit  information  to  IUD  (8)  4.6.3 

These  delays  total  136  minor  clocks  or  17  major  clocks.  During  this 
period  our  6  VEUs  could  generate  up  to  102  new  results,  each  of  which  might 
require  a  new  physical  buffer  location.  Adding  this  figure  to  our  earlier 
estimate  of  64  different  virtual  addresses,  we  can  see  that  a  buffer  size  of 
256  seems  a  reasonable  size  and  leaves  a  substantial  margin  for  error.  The 
VEUs  will  contain  96  of  these  locations  as  their  result  buffers  and  the  re- 
mainder will  be  within  the  Vector  Buffer.  The  internal  design  of  the  Vector 
Buffer  will  be  functionary  the  same  as  the  data  buffer  described  in  Section 
4.4.2.4. 
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4.4.4   Vector  Switch 

The  design  of  the  Vector  Switch  requires  that  one  solve  two  basic 
problems.  First  of  all,  one  must  determine  the  number  of  ports  to  and  from 
the  various  units.  Secondly,  there  is  the  problem  of  the  internal  structure 
of  the  switch.  We  begin  with  a  discussion  of  the  ports. 

We  will  assume  a  machine  with  four  binary  VEUs  and  two  unary  VEUs. 
This  could  correspond  to  two  routers  and  four  vector/tree  arithmetic  units. 
This  will  require  eight  ports  going  to  the  binary  VEUs  and  four  ports  coming 
from  them.  The  unary  units  require  two  input  ports  and  two  output  ports.  A 
single  port  is  required  going  to  the  Scalar  Buffer.  The  remaining  units  re- 
quiring ports  are  the  Vector  Buffer  and  primary  memory.  The  optimal  size 
for  these  paths  to  these  units  depends  on  the  ratio  of  primary  memory  refer- 
ences to  total  operand  references.  This  figure  varies  across  programs  and 
within  an  individual  program.  In  designing  the  ports  to  the  Vector  Buffer, 
we  will  assume  two-thirds  of  all  instructions  access  buffer  locations  already 
available.  In  desinging  the  main  memory  ports,  we  will  assume  two-thirds  of 
all  instructions  require  a  memory  access.  These  assumptions  should  assure  us 
that  even  in  the  worst  cases  the  capacity  of  the  Vector  Switch  will  not  slow 
the  machine  by  more  than  a  factor  of  one-third.  Experimentation  with  an 
existing  machine  would  undoubtedly  provide  the  data  for  determining  more  cost 
effective  distributions  of  ports.   In  the  case  of  memory  ports,  our  assump- 
tions lead  to  a  requirement  of  eight  ports  coming  from  memory  and  four  ports 
going  to  memory.  In  the  case  of  the  Vector  Buffer,  things  are  a  bit  more 
complex.  All  operands  that  originate  in  Primary  Memory  are  stored  in  the 
Vector  Buffer.  Those  operands  that  were  computed  by  earlier  instructions 
may  be  in  the  VEU  which  computed  them.  Providing  eight  input  and  output 
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ports  for  the  Vector  Buffer  should  roughly  conform  to  our  assumption. 
Table  12  summarizes  these  conclusions. 

TABLE  12   VECTOR  SWITCH  PORTS 


Unit 

2  Unary  VEUs 

4  Binary  VEUs 

Scalar  Buffer 

Memory 

Vector  Buffer 

TOTAL 


Input  Ports 
(to  unit) 

Output  Ports 
(from  unit) 

2 

2 

8 

4 

1 

0 

4 

8 

8 

8 

23 

22 

We  now  turn  our  attention  to  the  internal  structure  of  the  Vector 
Switch.  It  is  a  pipelined  crossbar  switch  with  queued  instructions  asso- 
ciated with  each  of  its  entry  ports.  Once  a  path  in  the  switch  has  been 
reserved,  it  will  remain  active  for  8  minor  clocks  and  allow  the  transfer  of 
an  8-word  vector.  Thus,  there  is  a  fairly  long  time  available  for  searching 
the  queues.  This  is  important  because  requests  to  use  the  Vector  Switch  may 
be  made  long  before  the  operand  is  available.  Thus,  in  searching  its  queues, 
the  Vector  Switch  must  not  only  be  sure  that  a  path  is  available,  but  must 
also  determine  that  the  data  is  present.  The  presence  of  data  is  indicated 
by  a  single  bit  which  is  set  whenever  data  is  stored  in  any  of  the  vector 
buffer  locations.  This  bit  is  reset  whenever  the  corresponding  physical 
location  is  freed;  i.e.,  when  its  use  count  is  zero  and  the  corresponding 
logical  location  has  been  reused.  Note  that  the  algorithms  for  keeping  track 
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of  vector  buffer  storage  are  simpler  than  those  for  scalar  buffer  storage, 
because  each  different  valued  vector  has  a  different  physical  address,  and 
there  is  no  need  for  time  indexes  to  keep  track  of  them.  On  the  other  hand, 
the  scalar  switch  does  not  have  to  test  for  the  presence  of  data  since  re- 
quests are  never  entered  in  its  queues  until  the  data  is  available. 

Most  of  the  logical  design  for  the  above  mentioned  functions  is  similar 
to  work  we  have  already  done.  However,  the  large  number  of  "functionally 
identical"  ports  going  to  and  from  the  vector  buffer  and  memory  does  present 
us  with  a  new  allocation  problem.  The  solution  is  to  assume  that  these  paths 
become  available  in  a  time  skewed  fashion.  There  are  in  all  cases  either  4 
or  8  paths  which  are  tied  up  for  8  minor  clocks  once  they  are  reserved. 
Further,  because  they  feed  memories  that  must  be  allocated  in  a  time  skewed 
fashion,  some  form  of  time  skewing  is  required.  Thus,  we  can  assume  that 
only  one  of  these  paths  becomes  available  in  each  minor  clock  and  the  stan- 
dard priority  hardware  from  Section  4.4.2.4  can  be  used.  The  same  priority 
unit  will  be  used  to  schedule  all  the  paths  in  any  equivalent  set.  This 
scheme  will  accommodate  the  problems  associated  with  multiple  input  ports. 
We  have  a  related  problem  associated  with  multiple  output  ports.  We 
are  searching  queues  to  drive  the  ports  and  at  least  one  minor  clock  is  re- 
quired for  each  test  of  a  queue  entry.  Thus,  unless  every   entry  tested  is 
ready  to  transfer,  we  cannot  run  at  the  maximum  possible  data  rate.  We  solve 
this  problem  with  multiple  queues.  One  queue  for  every   four  paths  will  allow 
every   other  entry  to  be  unavailable  and  still  run  at  maximum  rate.  Table  13 
summarizes  the  hardware  in  the  vector  switch. 
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TABLE  13   VECTOR  SWITCH  HARDWARE 


QUEUES  FOR  ALL  UNIT  OUTPUT  PORTS 


Unit 

Number 
Units 

Paths/ 
Unit 

Queues/ 
Unit 

aQueue 
Size 

Gate 

Count/ 

Queue 

Total 
Gates 

Unary  VEUs 

2 

1 

1 

16 

6  696 

13  392 

Binary  VEUs 

4 

1 

1 

16 

6  696 

26  784 

Memory 

1 

4 

1 

64 

23  784 

23  784 

Vector  Buffer 

1 

8 

2 

64 

23  784 

47  568 

II.   CONFLICT  RESOLUTION  CIRCUITS  FOR  ALL  UNIT  INPUT  PORTS 


Unit 

Number 
Units 

Paths/ 
Unit 

Requesting 
Units 

uGate 
Count 

Total 
Gates 

Unary  VEUs 

2 

1 

8 

168 

336 

Binary  VEUs 

4 

2 

8 

168 

1344 

Scalar  Buffer 

1 

1 

8 

168 

168 

Memory 

1 

8 

8 

168 

168 

Vector  Buffer 

1 

8 

7 

160 

160 

III.   CROSSBAR  SWITCH 

Switch  Size:  21  x  22  x  80  bits 

Gates:       147,840 

TOTAL  GATES:  262,544 


a. 

b. 
c. 
d. 


See  previous  section  for  basis  of  estimates. 

See  Section  4.3.2.2  for  queue  gate  count  formula. 

This  is  the  total  number  of  queues  minus  the  queues  for  this  unit, 

See  Section  4.4.2.4  for  priority  logic  gate  counts. 
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4.5   MAIN  MEMORY 

Logically  main  memory  is  divided  into  pages.  These  pages  are  8  words 
wide.  A  reasonable  length  would  be  1  K.  Physically  each  page  is  indepen- 
dently queue  driven.  Load  and  store  systems  of  switches  connect  these  pages 
to  buffers  in  front  of  the  memory  ports  in  the  Vector  Switch.  Other  switches 
distribute  queue  entries,  modes,  and  indexes  to  the  control  portion  of  each 
page.  No  indexing  is  allowed  across  pages.  Figure  15  shows  the  overall 
structure  of  memory  and  the  load  system  of  switches. 

We  now  discuss  the  organization  and  operation  of  this  system.  It  may 
frequently  happen  that  for  a  short  period  of  time  it  is  desirable  to  access 
an  individual  memory  page  at  the  maximum  rate  possible  for  that  page.  On  the 
other  hand,  the  number  of  memory  ports  in  the  vector  switch  makes  it  point- 
less to  be  able  to  simultaneously  access  all  pages  at  their  maximum  possible 
rate.  There  is  a  virtually  unlimited  number  of  ways  a  memory  may  be  organ- 
ized, considering  these  constraints.  We  have  chosen  one  that  seems  reasona-  ; 
ble  and  workable.  We  will  consider  a  1  megaword  memory.  It  will  be  clear 
from  the  discussion  how  to  generalize  to  other  sizes.  The  Vector  Switch  has 
8  input  ports  for  communicating  with  memory.  (See  Section  4.4.4.)  Thus, 
we  want  to  design  our  switch  to  accommodate  this  data  rate  coming  from  any- 
where in  memory.  We  will  assume  the  cycle  time  for  main  memory  is  8  minor 
clocks.  Thus,  the  data  path  leaving  one  memory  page  need  only  be  one  word 
wide  if  it  is  pipelined  at  one  transfer  ewery  minor  cycle.  To  exactly  accom- 
modate the  vector  switch  data  rate,  we  need  to  allow  at  most  8  pages  trans- 
fering  data  at  any  given  instant. 

The  pages  are  grouped  into  blocks  of  8.  There  are  16  of  these  blocks. 
We  allow  a  maximum  simultaneous  transfer  of  up  to  8  words  from  each  of  these 
groupings.  All  transfers  are  pipelined  at  the  rate  of  one  per  minor  clock. 


115 


^ 


or 

<t 

00 

LU 

CO 

Q 

CO 

1— c 

3 

o 

o 

Q  a: 

C£   LU 

00 

o  u_ 

3  Li_ 

X 

i  rj 

00  CO 

00 

1 

CO 

a: 

LU 

<c 

U3 

CQ 

<c 

CO 

Q. 

CO 

O 

>- 

or 
o 

s: 

CO 

LU 

X 

CO 

CO 

CO 


> 


VO 


CO 

q: 

LU 

< 

CD 

CO 

<c 

CO 

Q_ 

CO 

o 

>- 

Cd 
O 

o 

CO 

LU 

X 

CO 

CO 

CO 

Q 

o 


<c 


ct: 
o 

>- 
en 
o 


LO 
LU 

ct: 


y 


1 

o 

—1 

Od 

<  H- 

O 

■zz  a: 

O  O  LU 

_l 

c_>  _i 

116 


A  combination  of  crossbar  switches  and  global  control  are  used  to  referee 
conflicts.  Before  a  page  is  allowed  to  initiate  a  transfer  into  this  struc- 
ture, it  must  have  a  path  reserved  all  the  way  to  the  highest  level  of  the 
structure.  This  is  not  to  say  that  the  path  must  be  clear  at  the  time  the 
transfer  begins,  but  only  that  it  will  become  clear  at  each  stage  when  re- 
quired. We  will  now  discuss  the  algorithms  for  allocating  these  paths. 

Since  8  minor  clocks  are  required  to  complete  a  vector  transfer,  we  need 
only  allocate  our  various  groups  of  8  paths  at  the  rate  of  one  per  minor 
clock.  Up  through  the  first  level  of  crossbar  switches,  every   page  has  its 
own  path.  However,  the  outputs  of  these  paths  are  ganged  together  so  that 
one  output  from  each  of  the  level  1  switches  is  an  input  to  the  same  path  in 
the  level  2  switch.  Thus,  allocating  paths  consists  of  determining  which 
memory  pages  may  initiate  transfers  through  the  level  2  switch  and  trans- 
mitting to  the  switches  the  identity  of  the  paths  available.  At  a  given 
clock,  any  number  of  paths  in  the  level  2  switch  may  be  available.  However, 
to  keep  our  allocation  algorithm  to  a  reasonable  size,  we  will  consider  that 
at  most  one  path  becomes  available  during  each  minor  clock.  At  most  we  in- 
troduce brief  transient  delays  by  this  restriction.  There  will  be  no  loss 
in  assuming  that  a  given  fixed  path  becomes  available  at  a  given  clock.  In 
other  words,  the  global  control  attempts  to  allocate  the  level  2  paths  in  a 
round-robin  basis.  If  there  are  no  outstanding  requests  at  a  given  clock, 
then  the  path  assigned  that  time  slot  will  remain  vacant  at  least  for  the 
next  8  minor  clocks.  We  must  keep  the  number  of  pages  requesting  a  path  at 
a  given  clock  to  a  number  we  can  handle.  This  can  be  done  by  having  local 
controllers  limiting  the  requests  from  each  group  of  8  pages  to  one. 
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Thus,  the  global  controller  will  have  at  most  8  requests  to  deal  with 
in  any  clock.  The  controllers  at  both  levels  will  use  the  access  controller 
described  in  Section  4.4.2.4. 

We  can  now  describe  the  complete  functioning  of  the  memory  in  trans- 
fering  data  to  the  Vector  Switch.  The  numbered  paragraphs  correspond  to 
successive  minor  clocks. 

1.  All  memory  pages  with  queue  entries  ready  to  initiate  a  memory 
access  send  a  bit  to  the  local  controller. 

2.  Each  local  controller  selects  one  of  these  pages  for  possible 
transfer  and  sends  to  the  global  controller  a  request  for  a  path 
if  it  had  any  requests. 

3.  The  global  controller  selects  one  of  the  requests  from  the  local 
controllers  to  honor  and  notifies  the  local  controller. 

4.  The  local  controllers  notify  the  winning  page. 

5.  The  transfer  begins  through  the  first  level  of  the  crossbar.  At 
clock  3  the  global  controller  also  notified  the  local  controller 
which  path(s)  to  use  in  the  crossbar. 

6.  The  transfer  from  the  lower  level  crossbar  to  the  global  crossbar 
begins. 

7.  The  transfer  from  the  global  crossbar  to  the  buffer  begins.  The 
global  crossbar  works  in  a  fundamentally  different  way  from  the 
local  crossbars.  It  is  successively  transfering  data  to  different 
modules  in  one  of  the  buffers.  Thus,  with  each  minor  clock,  it 
changes  its  configuration. 

Several  remarks  about  this  process  are  necessary.  First,  the  entire 
unit  must  be  pipelined  so  that  each  function  is  occurring  at  e\/ery  clock. 
In  practice,  this  is  not  particularly  difficult;  it  only  requires  some 
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buffering  of  information.  The  requests  for  transfer  always  come  from  the 
pages  4  clocks  prior  to  the  time  they  are  actually  able  to  begin  the  trans- 
fer. Thus,  the  only  loss  from  the  decision  delays  occur  when  a  new  entry 
arrives  in  the  intervening  4  clocks.  The  switches  and  memories  may  all 
operate  all  the  time  and  at  the  maximum  data  rate  the  Vector  Switch  allows. 
With  the  transfer  of  the  data,  a  queue  entry  is  also  transferred.  This 
queue  entry  will  be  used  to  request  use  of  the  Vector  Switch.  The  data 
paths  must  be  slightly  larger  than  one  word  to  accommodate  the  queue  entry, 
which  would  probably  need  to  be  divided  into  8  parts. 

A  request  to  store  data  always  takes  precedence  over  a  request  to  load 
data.  To  initiate  a  store,  a  path  must  be  reserved  through  a  switching 
network  similar  to  the  one  just  discussed.  In  addition,  it  must  be  verified 
that  there  is  room  in  the  store  buffer  of  the  destination  page.  A  buffer 
for  stores  within  each  page  is  desirable  because  of  possible  sequencing 
problems  which  we  will  discuss  later  in  this  section  when  we  describe  the 
internal  operation  of  a  single  memory  page.  Since  the  Vector  Switch  only 
has  four  output  ports  to  memory,  allowing  one  store  to  be  initiated  in  each 
minor  clock  allows  for  transients  of  twice  the  maximum  long-term  data  rate. 
Thus,  we  will  provide  a  switching  network  to  allow  the  transfer  of  one  store 
request  per  minor  clock  to  the  specified  page.  If  that  page  has  space  avail- 
able, it  sends  back  to  the  requesting  unit  a  signal  to  proceed.  This  sequence 
is  pipelined  and  requires  two  minor  clocks.  For  stores,  there  can  be  no  con- 
flicts like  those  encountered  with  loads.  Thus,  we  simply  use  the  next  slot 
in  the  highest  level  crossbar  and  ask  to  set  up  the  level  1  crossbar  that 
services  our  destination  page. 

In  addition  to  the  load  and  store  switch  paths  just  discussed,  we  need 
an  instruction  switch  to  transfer  load  queue  entries  to  the  appropriate  page. 
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Since  only  one  instruction  is  required  for  each  vector  load,  this  network 
can  be  similar  to,  but  less  complex  than,  the  load  and  store  switching  net- 
works. This  same  switch  can  be  used  for  the  transfer  of  scalar  indexes  and 
scalars  used  for  mode  control. 

We  now  come  to  the  internal  structure  of  the  individual  pages.  Figure 
16  shows  this  structure.  Entries  from  the  instruction  switch  may  be  either 
load  queue  entries,  modes,  or  scalar  indexes  and  are  switched  either  into 
a  scalar  buffer  or  into  an  instruction  queue.  Entries  may  arrive  from  this 
switch  at  the  rate  of  one  per  minor  clock.  Since  a  vector  access  can  only 
occur  once  every  major  clock,  this  will  be  a  more  than  adequate  data  rate. 
Vectors  arriving  from  the  store  switch  may  be  transferred  either  to  the  vec- 
tor index  buffer  or  to  the  store  buffer,  depending  on  their  intended  use. 
The  store  buffer  allows  for  the  load  switch  and  store  switch  to  be  transfer- 
ring data  with  the  same  memory  page  at  the  same  time.  The  load  buffer  allows 
memory  to  be  synchronized  with  the  load  switch.  The  control  processes  the 
queued  instructions  and  referees  possible  conflicts  between  the  load  and 
store  switches. 

We  will  now  outline  the  operation  of  the  memory  page  control.  The  queue 
which  contains  both  load  and  store  instructions  is  continuously  interrogated 
to  see  if  an  instruction  can  proceed.  The  conditions  which  must  be  met  are 
as  follows: 

1.  All  required  indexes  and  modes  are  present. 

2.  No  earlier  instruction  which  conflicts  with  this  instruction  is 
still  in  the  queues. 

3.  In  the  case  of  stores,  the  required  data  is  present. 
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FIGURE  16   INTERNAL  STRUCTURE  OF  MEMORY  PAGE 
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Condition  2  requires  further  explanation.  Clearly,  no  load  or  store 
can  proceed  if  there  is  an  earlier  store  with  unknown  indexing  and  unknown 
or  overlapping  modes.  Similarly,  no  store  can  proceed  if  there  is  an 
earlier  load  with  unknown  indexes  and  unknown  or  overlapping  modes.  These 
are  the  weakest  possible  conditions  for  the  existence  of  conflicts,  and  it 
would  be  possible  to  test  for  these  specific  conditions  among  the  first 
few  queue  entries.  We  will  base  our  gate  estimates  on  this  capability, 
although  a  somewhat  weaker  condition  might  prove  more  practical. 

Space  in  the  indexing  buffers  is  reserved  by  the  IUD  in  the  same  manner 
as  space  within  the  Vector  Buffer.  Thus,  these  buffers  have  to  notify  the 
IUD  when  the  values  they  contain  have  been  used  and  the  space  is  free.  One 
can  estimate  sizes  for  these  buffers  by  an  analysis  like  that  in  Section 
4.4.3.  Determining  a  size  for  the  store  buffer  is  more  complex  since  it  is 
dependent  on  how  much  instruction  reordering  is  done  by  the  queue  and  con- 
trol. We  will  estimate  a  size  of  8  as  being  reasonably  small  and  probably 
larger  than  will  usually  be  needed.  Table  14  provides  a  summary  of  all  the 
hardware  described  in  this  section. 

There  are  two  capabilities  that  we  do  not  provide  in  this  design  that 
might  be  of  considerable  practical  value.  One  would  be  the  ability  to  pro- 
vide index  arithmetic  within  each  memory  page.  Loops  might  often  involve 
performing  simple  operations  on  the  same  base  index  set  and  within  the  same 
memory  page.  The  second  capability  is  to  provide  memory-to-memory  and 
memory- to- index  register  transfers  without  going  through  the  Vector  Buffer 
and  Vector  Switch.  Both  of  these  capabilities  could  be  provided  without 
major  increases  in  the  logical  complexity  of  the  system  and  should  undoubted- 
ly be  considered. 
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TABLE  14   MEMORY  LOGIC  SUMMARY 


Ml              Source  of  Gate  Estimate  Gate  Count 
First  we  list  the  gate  counts  for  the  units  in  a  memory  page  (Figure  16). 

Instruction  Queue     Table  4  5  500 

Control            Estimate  based  on  function  1  000 

Switch             1x2  64  bits  384 

Scalar  Buffer        16  words  by  10  bits  640 

Vector  Index  Buffer   8x8  words  by  10  bits  2  560 

Store  Buffer        8  words  by  64  bits  2  048 

Load  Buffer         8  words  by  64  bits  2  048 

Total  for  memory  page  excluding  memory:  14  664 

128  pages  are  required  for  a  million  words:  1  876  992 

Now  we  compute  accessing  network  gate  counts  as  illustrated  by  Figure  15. 

8x8  Switch  72  bit  words 

Local  Controller     Estimate  based  on  function 
Load  Network  has  17  switches  and  16  controllers: 
Store  Network  has  one  controller  and  17  switches: 
I/O  Network  has  16  switches  and  one  controller: 

Total  Exclusive  of  Memory: 


18  432 
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345 

344 
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344 

296 

912 
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,843, 
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4.6   INSTRUCTION  UNIT  DISPATCHER 


4.6.1   Introduction 


The  Instruction  Unit  Dispatcher  (IUD)  has  the  responsibility  of 
mapping  OFFL  instructions  from  up  to  four  MIDs  into  some  collection  of 
execution  units.  It  must  ensure  that  the  correct  operands  for  an  instruc- 
tion will  meet  in  the  unit  assigned  that  instruction.  The  principal  problem 
in  designing  this  unit  is  maintaining  a  high  instruction  rate  while  provid- 
ing an  "intelligent"  scheduling  algorithm.  The  scheduling  algorithm  must, 
as  a  minimum,  assure  that  no  blockages  result  and  maintain  the  correct 
logical  sequence  of  operations. 

In  describing  the  IUD,  we  will  first  outline  its  functional  structure, 
ignoring  all  problems  associated  with  maintaining  the  necessary  high  data 
rate.  We  will  then  determine  what  degree  of  pipelining  and  parallelism 
will  be  necessary.  We  will  discuss  in  more  detail  the  various  operations 
that  the  IUD  performs.  In  this  discussion  we  will  bring  in  any  algorithm 
modifications  necessitated  by  the  combination  pipeline  parallel  processing 
required.  We  will  then  provide  a  detailed  logical  design  of  the  IUD,  com- 
plete with  gate  counts. 

4.6.2   IUD  Functional  Structure 

The  IUD's  operation  is  partitioned  into  several  tasks.  Three  broad 
categories  are:  work  on  operands,  work  on  results,  and  construction  of 
queue  entries.  The  three  types  of  operands  are  vectors,  scalars,  and  main 
memory  vectors.  For  main  memory  vectors,  the  IUD  merely  passes  on  the 
specified  address  to  the  correct  memory  box.  For  scalar  instructions,  a 
time  index  is  necessary  to  uniquely  identify  the  operand.  An  associative 
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memory  table  is  accessed  to  obtain  this  time  index.  The  use  made  of  this 
time  index  is  discussed  in  Section  4.3.  A  logical  vector  operand  must  be 
mapped  onto  the  correct  physical  vector  register.  Another  associative 
memory  is  provided  for  this  function. 

Scalar  results  are  used  to  update  the  scalar  status  table  mentioned 
above.  Similarly,  vector  results  are  used  to  update  the  vector  status  table 
which  maps  physical  to  logical  registers. 

Both  operands  and  results  as  well  as  the  operation  fields  are  used  by 
the  IUD  in  generating  various  queue  entries.  Where  execution  units  are  not 
unique,  the  IUD  must  decide  which  to  use.  In  the  case  of  vector  instruc- 
tions, it  must  reserve  space  in  the  VEU  and  set  up  queue  entries  to  route 
data  as  required.  Finally,  it  must  set  up  the  queue  entry  for  the  execution 
of  the  operation  itself. 

4.6.2.1   Data  Rate  Analysis 

The  instruction  rate  the  IUD  must  handle  is  a  function  of  the  proces- 
sing rate  of  the  various  execution  units.  A  reasonable  average  is  one 
operation  per  major  clock.  For  this  computation,  we  will  assume  all  operands 
originate  in  memory  and  are  returned  to  memory.  This  is  an  extremely  conser- 
vative assumption.  It  will  be  somewhat  balanced  by  neglecting  memory-to- 
memory  instructions  and  transfers  between  scalar  memory  and  vector  memory. 
The  overall  assumption  is  still  somewhat  conservative. 

Each  vector  operation  counts  as  4  instructions  for  a  binary  operation 
(3  memory  instructions  plus  the  actual  operation)  and  3  instructions  for  a 
unary  operation.  Each  scalar  operation  counts  as  one  instruction. 
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Reasonable  values  are  4  vector  binary  units  plus  2  vector  scalar  units. 
In  addition,  6  scalar  units  is  a  likely  value.  Thus,  we  should  be  able  to 
process  roughly  28  instructions  through  the  IUD  in  one  major  clock.  This 
comes  to  4  instructions  per  minor  clock  with  full  pipelining. 

4.6.2.2  Memory  Operands  and  Results 

Sequencing  of  instructions  refering  to  memory  is  controlled  by  the 
individual  memory  boxes.  The  instructions  need  only  be  passed  on  to  the 
appropriate  memory  box  in  the  correct  sequence.  The  IUD  need  not  perform 
additional  processing  on  these  operands. 

4.6.2.3  Scalar  Operands  and  Results 

In  order  to  ensure  proper  sequencing  of  scalar  instructions,  operands 
must  have  both  a  time  and  place  index.  These  designate  a  particular  loca- 
tion in  the  scalar  buffer  and  a  particular  "time  index"  which  uniquely 
identifies  a  store  to  that  location.  To  ensure  that  no  operand  will  be 
over-written  when  it  is  still  required  by  a  queued  instruction,  the  SEU 
must  be  provided  with  a  count  of  the  number  of  pending  requests  for  a  given 
instruction  result.  The  scalar  status  table  provides  the  information  neces- 
sary to  construct  the  time  index  and  generate  the  operand  use  count.  The 
range  of  the  time  index  was  discussed  in  Section  4.3.  The  values  128  and 
256  were  determined  as  reasonable  options  in  that  section. 

The  status  table  is  an  associative  memory  containing  entries  for  all 
recent  stores  to  scalar  memory  that  may  be  ambiguous.  It  contains  a  time 
index  and  scalar  memory  location  for  each  entry.  In  addition,  it  contains 
a  disable  bit  and  a  bit  to  indicate  if  the  address  refers  to  a  vector  stored 
across  the  scalar  memory.  Whenever  an  instruction  with  result  to  scalar 
memory  is  processed  and  there  is  another  store  to  that  location  in  the 
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queues,  a  new  entry  is  made  in  an  available  location  in  the  status  table. 
In  addition,  a  previous  entry  using  that  same  location  has  its  disable  bit 
set  and  its  location  recorded  in  another  table  as  available. 

We  still  have  the  problem  that  some  of  the  operands  may  refer  to  re- 
sults being  processed  in  parallel  with  these  operands.  We  take  care  of  this 
case  by  special  circuitry  containing  the  scalar  results  being  processed. 
This  special  circuitry  finds  the  entry  with  the  latest  time  index  earlier 
than  the  time  index  for  a  particular  operand.  This  requires  a  comparison 
tree,  but  since  only  a  small  number  of  results  are  processed  in  parallel, 
this  tree  is  quite  small.  Additional  circuitry  selects  the  time  index  from 
either  this  comparison  tree  or  from  the  full  table  search  when  the  tree 
finds  no  match.  The  full  table  update  for  the  results  being  processed  occurs 
in  the  same  clock.  The  search  for  current  operands  will  not  see  these  en- 
tries until  one  clock  later. 

Finally,  the  scalar  status  table  must  be  kept  from  becoming  full. 
Thus,  we  want  to  remove  entries  from  it  as  quickly  as  possible.  As  soon  as 
any  instruction  which  causes  an  entry  to  be  made  in  this  table  has  completed, 
the  associated  entry  in  the  table  can  be  freed.  Thus,  there  is  an  additional 
bookkeeping  table  containing  instruction  indices  and  associated  scalar  status 
table  locations.  When  notification  comes  that  a  scalar  instruction  has  com- 
pleted, this  table  is  used  to  determine  if  a  scalar  status  table  instruction 
may  be  freed.  Since  a  scalar  status  table  location  may  be  freed  before  the 
associated  instruction  is  complete,  this  bookkeeping  table  needs  to  be  up- 
dated whenever  this  occurs. 
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4.6.2.4   Vector  Operands  and  Results 

In  the  case  of  the  vector  buffer,  the  status  table  must  map  every 
physical  register  in  use  onto  the  corresponding  logical  register.  The  num- 
ber of  physical  vector  registers  is  much  smaller  than  the  number  of  physical 
scalar  registers.  Thus,  128  or  256  is  a  reasonable  size  for  this  table, 
corresponding  to  the  size  of  the  vector  buffer  discussed  in  Section  4.4.3. 
Like  the  SEU,  the  VEU  must  be  provided  with  a  count  of  the  number  of  ac- 
cesses to  a  particular  set  of  data  values.  The  vector  status  must  contain 
two  pieces  of  information  to  perform  the  functions  described  above:  the 
physical  location  of  a  logical  register  and  the  logical  register  identifica- 
tion. 

We  will  first  discuss  the  use  of  this  table,  ignoring  the  fact  that 
more  than  one  instruction  is  being  processed  in  parallel.  When  a  vector 
operand  is  encountered,  this  table  is  accessed  as  an  associative  memory  to 
find  the  physical  register  corresponding  to  the  designated  logical  register. 
If  there  is  no  corresponding  entry,  this  is  an  error  condition  which  should 
cause  a  program  interrupt.  Since  the  vector  buffer  is  distributed  among 
the  VEUs  as  well  as  being  in  a  central  vector  buffer,  the  physical  location 
identifies  the  unit  as  well  as  the  location  within  the  unit.  This  informa- 
tion will  be  used  in  selecting  the  VEU  to  use  when  there  is  more  than  one 
which  may  be  used. 

A  vector  result  causes  the  associated  identification  of  that  logical 
register  to  be  altered  to  correspond  to  the  new  physical  register.  In 
addition,  it  causes  a  signal  to  be  sent  to  the  unit  containing  the  register, 
indicating  that  the  physical  register  may  be  freed  once  all  pending  requests 
on  it  have  cleared.  In  turn,  when  all  requests  have  cleared,  this  unit 
notifies  the  IUD. 
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The  problems  associated  with  parallel  processing  of  instructions  are 
more  complex  than  those  encountered  in  the  scalar  case.  This  is  a  conse- 
quence of  the  fact  that  the  physical  location  of  a  logical  result  is  not 
known  until  processing  of  that  instruction  is  nearly  complete.  To  accom- 
modate this  situation,  a  special  bit  will  signify  a  not  yet  known  address. 
In  addition,  the  time  index  of  the  corresponding  result  will  be  provided 
in  place  of  the  physical  address.  Special  logic  to  fill  in  this  informa- 
tion will  be  described  in  Section  4.6.3.7.  This  same  logic  provides  this 
information  to  be  added  to  the  vector  status  table  when  the  information 
becomes  available.  In  addition,  we  need  a  comparison  tree  similar  to  that 
for  the  scalar  status  table,  discussed  in  the  previous  section. 

4.6.2.5  Scalar  EU  Assignment 

There  may  be  several  SEUs  that  are  functionally  equivalent.  We  must 
provide  a  method  of  selecting  which  SEU  will  be  used  for  a  given  instruc- 
tion. Since  the  SEUs,  unlike  the  VEUs,  do  not  contain  any  operands,  (see 
following  section),  the  only  consideration  that  seems  reasonable  to  take 
into  account  is  the  size  of  the  various  queues.  Thus,  logic  will  be  pro- 
vided to  keep  track  of  where  the  next  n  scalar  instruction  should  be 
assigned  for  each  set  of  equivalent  SEUs,  where  n  is  the  maximum  number  of 
instructions  that  can  be  processed  in  any  clock.  The  logic  will  update 
this  information  eyery   clock  based  on  which  SEUs  were  assigned  in  the  previ 
ous  clock  and  based  on  information  from  the  SEUs  on  instructions  completed. 

4.6.2.6  Vector  EU  Assignment 

Vector  operands  may  reside  in  a  specific  vector  execution  unit  and 
there  exists  logic  to  use  these  as  operands.  In  order  to  lessen  the  load 
on  the  vector  switch  as  well  as  to  minimize  transfer  delays,  we  want  to 
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encourage  using  these  features.  The  question  of  what  constitutes  an  opti- 
mal scheduling  scheme  is  extraordinarily  complex.  In  addition,  we  have 
severe  constraints  imposed  by  required  rapid  and  comparatively  cheap  hard- 
ware implementation.  We  will  propose  a  scheme  that  seems  workable  and 
reasonable. 

We  begin  our  discussion  by  considering  different  relations  between  the 
number  of  active  programs  (P  )  and  the  number  of  equivalent  EUs  (R  ).  When 
possible,  we  will  assign  specific  EUs  to  specific  programs  and  try  to  keep 
all  computation  within  assigned  EUs.  When  queue  size  discrepancies  become 
too  large,  we  will  start  distributing  the  operands. 

We  first  consider  cases  where  R  >  P  .  Each  program  may  have  its  own 

e    a 

resource  or  resources;  i.e.,  if 

nPo  >  Ra    and    n  ;>  1 
a    e 

then  each  program  has  n  resources  allocated  to  it. 

First  we  consider  the  case  of  a  single  resource  assigned  to  each  pro- 
gram. There  are  three  threshold  values  that  determine  which  EU  it  chooses. 
These  threshold  values  all  represent  differences  in  queue  sizes.  The  first 
two  are  limited  to  queues  of  EUs  assigned  to  a  particular  program. 

O2QJ    Size  of  queue  containing  both  operands  minus  size  of 

smallest  queue. 
0-jqj    Size  of  smallest  queue  containing  either  operand 

minus  size  of  smallest  queue. 
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The  next  two  values  represent  the  differences  of  a  queue  size  assigned  to 
the  program  minus  a  queue  size  not  assigned  to  the  program. 

°20IE    Size  of  Queue   containing  both  operands  minus  size 
of  smallest  queue. 

°10IE    Slze  of  smallest  queue  containing  one  operand  minus 
size  of  smallest  queue. 

The  final  two  thresholds  refer  to  queues  not  assigned  to  the  program: 

°20E    Slze  of  9ueue  containing  both  operands  minus  size 
of  smallest  queue. 

°10E    Size  °^  smallest  queue  containing  either  operand 
minus  size  of  smallest  queue. 

The  threshold  values  can  be  assigned  dynamically  or  be  hard-wired 
constants.  Experimentation  should  be  conducted  to  determine  optimal  values. 
Associated  at  any  instant  in  time  with  a  threshold  value  is  the  actual  value 
of  the  corresponding  condition.  We  will  label  these  as  0A  with  the  same 
subscript.  Thus,  0A10£I  is  the  actual  difference  of  the  smallest  queue 
assigned  to  this  program  branch  which  contains  a  single  operand  minus  the 
smallest  queue  size  not  assigned  to  this  program.  Both  queues  are  restricted 
to  those  capable  of  performing  the  operation  which  we  are  now  assigning  to 
a  queue.  We  will  define  Q(0),  where  0  is  a  threshold  parameter,  to  be  the 
index  of  the  first  queue  associated  with  this  parameter.  I.e.,  Q(0A?nTF) 
is  the  queue  with  two  operands.  Whenever  the  actual  value  for  a  given 
threshold  is  not  defined,  it  will  behave  as  infinity. 
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Finally,  we  will  consider  the  six  threshold  values  as  being  ordered 

in  the  sequence  they  were  defined  and  will  abbreviate  threshold  and  actual 

values  as  0.  and  0A.,  i  =0,  1,  ...,  5.  Thus,  0.  is  the  same  as  02Qr 

The  algorithm  for  queue  selection  in  the  case  there  R  >  P  is  to  select 

e    a 

Q(0A.)  for  the  least  i  such  that 

°Ai  <  °i 

If  no  i  satisfies  this  condition,  then  the  smallest  queue  is  chosen. 

Two  observations  seem  important  here.  First,  we  probably  have  more 
threshold  values  than  are  useful  and  experiments  would  probably  show  us 
how  to  limit  these.  Second,  other  threshold  values  are  possibly  meaning- 
ful. One  example  would  be  0,^^  referring  to  the  queue  size  difference  of 
a  queue  assigned  to  the  program  containing  one  operand  minus  queue  size  of 
a  queue  not  assigned  to  the  program  containing  one  operand.  As  stated 
previously,  our  algorithm  is  not  necessarily  optimal,  only  practical  and 
reasonable. 

In  the  case  of  R  <  P  ,  there  will  be  no  assignment  of  EUs  to  queues, 
e   a 

In  this  case  only  the  0-,Qr  will  be  meaningful.  Otherwise,  the  same  algo- 
rithm applies. 

In  the  case  of  multiple  resources  assigned  to  a  program,  additional 
threshold  values  are  needed  to  decide  within  these  assigned  resources  when 
small  queue  size  has  precedence  over  the  presence  of  operands.  At  this 
point,  it  seems  desirable  to  describe  a  general  theory  of  threshold  values. 
Let  there  be  K  classes  of  resources  ordered  so  that  the  first  of  these  is 
the  most  desirable  to  use  and  the  last  the  least  desirable.  Assume  that 
instructions  may  have  up  to  N  operands.  The  threshold  values  will  be 
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TABLE  15       CLASS  PAIRS 

1  Class  Pairs 

1  1,1 

2  1,2 

3  1,3 

i  i 

K  1,K 

K+l  2,2 

K+2  2,3 

2K-1  2,K 

2K  3,3 


agfil  K,K 
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TABLE  16   OPERAND  DISTRIBUTION  PAIRS 

Operand 

j_  Distribution  Pairs 

1  N,0 

2  N-1,0 

3  N-1,1 

4  N-2,0 

5  N-2,1 

6  N-2,2 

7  .  N-3,0 

8  N-3,1 

9  N-3,2 
10  N-3,3 

•  • 

•  •  « 

("+2)(N+1)  .  („.,)             0i0 

<N+2)(N+1)-  (N-2)              0,1 


(N+2)(N+1) 
2 


0,N 
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written  as  0...  The  actual  value  of  the  corresponding  difference  in  queue 
sizes  will  be  OA^..  (KOA^.)  refers  to  the  first  element  in  the  class  pair 
corresponding  to  i.  Tables  15  and  16  define  the  i  and  j  subscripts. 

For  this  more  complex  case,  no  linear  ordering  of  threshold  values 
makes  sense.  Instead,  the  algorithm  for  choosing  the  EU  will  be  to  take 
Q(0A..)  for  the  i,j  pair  such  that: 

I  J         I  J 

where  i  =  1,  2,  ...,  M|iU 

i  =  t  2    JMKJitll 

J     ■  »  (-  >  •  •  •  »       p 

Again,  not  all  the  0^.  are  defined,  and  the  matrix  of  useful  values  is 
likely  to  be  fairly  sparse. 

4.6.2.7   Generating  Vector  Switch  and  Internal  Switch  Queue  Entries 

After  we  have  selected  the  VEU,  we  must  reserve  physical  space  within 
the  VEU  for  the  operands  and  results.  We  then  need  to  generate  instructions 
for  the  various  switches  to  transfer  the  operands  to  the  VEU.  In  addition, 
we  must  reserve  space  for  the  results  and  update  the  vector  status  table 
accordingly.  Parallel  processing  of  instructions  constrains  us  to  first 
reserve  space  for  the  result.  Then  the  remainder  of  the  above  operations 
can  be  performed,  in  some  cases,  using  this  information  about  results  be- 
cause it  corresponds  to  one  of  the  operands. 
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4.6.2.3   Generating  Instructions  for  the  EUs  and  Memory 

At  this  point  we  have  discussed  how  all  the  information  necessary  for 
the  EU  instructions  is  generated.  Thus  the  only  remaining  problems  are  to 
assemble  the  information  together  and  transfer  it  to  the  appropriate  EU. 
To  assemble  the  instruction,  we  will  provide  addresses  with  each  of  its 
component  parts.  Using  these  addresses  to  assemble  the  instructions  poses 
no  special  problems.  In  designing  the  logic  for  transmitting  the  instruc- 
tions to  the  EUs,  we  wish  to  minimize  the  size  of  the  data  paths.  In  any 
given  minor  clock  all  the  instructions  emerging  may  be  scalar  instructions 
(or  any  other  single  type).  Since  no  single  unit  can  process  instructions 
at  this  rate,  it  makes  sense  to  buffer  the  output  to  the  verious  units. 

4.6.3   Logical  Structure 

In  this  section  we  will  provide  a  logical  design  for  the  functions 
discussed  in  Section  4.6.2.  We  will  do  this  in  sufficient  detail  to  pro- 
vide realistic  estimates  of  the  gate  counts  for  various  individual  compo- 
nents and  for  the  entire  unit.  We  will  first  discuss  the  overall  structure 
of  the  IUD  pipeline  and  then  go  on  to  describe  each  of  the  stages  and  units 
in  detail . 

4.6.3.1   IUD  Pipeline  Structure 

Before  we  begin  our  overall  analysis  of  the  pipeline  structure,  we 
need  to  provide  more  details  about  OFFL.  Thus  in  this  section  we  will  de- 
scribe the  OFFL  instruction  format  and  syntax.  We  will  use  this  information 
to  analyze  the  pipeline  requirements.  Then  we  will  present  a  time-versus- 
function  diagram  of  the  IUD's  operation.  In  this  section  we  will  describe 
instructions  and  design  in  considerable  detail.  We  do  this,  not  because 
we  believe  the  design  is  optimal  or  that  there  is  anything  sacred  about  the 
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particular  design  decisions  we  have  made,  but  because  our  approach  differs 
radically  from  conventional  control  unit  design.  Thus,  details  are  worked 
out  both  as  a  discipline  onto  ourselves  to  ensure  that  we  are  not  over- 
looking any  devastating  problem,  and  as  a  means  of  convincing  the  reader 
that  our  overall  approach  is  workable. 

4.6.3.1.1   OFFL  Instruction  Format 

OFFL  instructions  consist  of  a  variable  number  of  16-bit  bytes.  The 
first  of  these  specifies  the  operation  to  be  performed.  It  has  the  follow- 
ing format: 

Fields    Meaning 

0:1      1.  This  byte  is  1  only  for  the  operator  portion  of  an 
instruction. 

1:1      0  indicates  an  SEU  instruction.  1  indicates  a  VEU 

instruction.  (All  memory  instructions  are  to  or  from 
the  VEU.) 

2:2      Contains  the  program  number.  (Up  to  four  different 

programs  can  be  executing  simultaneously  with  instruc- 
tion level  multiprogramming.) 

4:4      EU  address.  (Specifies  the  type  of  EU  that  this  instruc- 
tion requires.) 

9:8      Information  to  be  interpreted  by  the  specified  EU.  (This 
field  is  used  by  the  EU  to  determine  what  operation  to 
perform.  If  it  is  identically  0,  then  the  next  byte  con- 
tains information  to  be  passed  on  to  the  EU.) 


137 


The  EU  operation  field  or  literal  information  for  the  EU  can  be  extended 
over  up  to  four  additional  bytes.  Bit  0  of  each  of  these  bytes  contains 
a  1.  The  remainder  of  the  byte  contains  information  for  the  EU. 

The  following  is  the  format  of  all  possible  OFFL  operands  and  results 

Fields    Meaning 

0:1      0 

1:1      Indicates  that  this  is  a  result  and  also  signifies  the 
physical  end  of  the  instruction.  In  the  case  of  a 
memory  result,  the  physical  end  of  the  instruction  occurs 
not  at  this  byte,  but  at  the  next. 

2:1      0  indicates  a  scalar.  1  indicates  a  vector. 

3:1      Vector  only.  0  indicates  a  logical  vector  buffer  address; 
1  indicates  a  main  memory  address. 

3:13     Scalar  only.  Physical  scalar  buffer  address. 

4:12     Vector  only.  In  the  case  of  a  memory  address  (bit  2), 
this  field  contains  the  physical  location  within  the 
physical  memory  box  specified  by  the  next  byte.  In  the 
case  of  a  logical  vector  buffer  address  (bit  2),  this 
field  contains  that  address. 

As  mentioned  in  the  above  table,  memory  results  are  two  bytes  long.  The 
second  byte,  except  for  bit  0  which  is  0,  is  completely  used  to  specify  a 
memory  box.  This  completes  the  specification  of  the  format  of  OFFL. 

4.6.3.1.2   OFFL  Syntax 

In  this  section  we  specify  the  syntax  of  OFFL  instructions  in  BNF 
using  the  metalanguage  of  the  report  on  ALGOL  60.  We  will  describe  the 
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semantics  of  the  terminal  and  non-terminal  symbols  used  in  terms  of  the 
previous  section  and  then  present  the  brief  formal  syntax.  Note  that  if 
there  are  two  scalar  operands,  the  second  will  be  a  mode  pattern.  If  there 
is  only  one  scalar  operand,  then  the  operation  will  specify  whether  it  is 
a  mode  or  an  index.  Table  17  is  the  syntax  of  OFFL. 

4.6.3.1.3   Analysis  of  Pipeline  Requirements 

In  Section  4.6.2.1  we  determined  that  the  IUD  must  allow  for  the 
emergence  of  instructions  at  a  rate  of  approximately  four  instructions  per 
minor  clock.  In  this  section  we  will  analyze  how  that  requirement,  in  con- 
junction with  the  specification  of  OFFL  we  have  described  in  the  previous 
two  sections,  translates  into  physical  requirements  of  the  IUD  structure. 
Because  of  the  extreme  variability  of  instruction  length  and  complexity,  it 
would  be  unnecessarily  costly  to  allow  the  IUD  to  be  able  to  process  any 
possible  combination  of  instructions  at  an  emergence  rate  of  four  per  clock. 
As  a  first  step  in  determining  a  reasonable  design,  we  will  enumerate  rele- 
vant constraints  on  OFFL  instructions.  These  constraints  can  all  be  easily 
derived  from  the  previous  two  sections.  They  are  listed  in  Table  18. 
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TABLE  17   OFFL  SYNTAX 


Symbol 

Instruction 
Memory 

Not  Memory 

NM  Operator 

NM  Operand 
NM  Result 
M  Operator 

M  Operand 

V  Result 

V  Operand 
M  Result 

VS  Operand 
S  Operand 
S  Result 
OM  Address 


Meaning 

A  complete  OFFL  instruction  from  operation  to  result. 
A  complete  OFFL  instruction  which  makes  reference  to 
main  memory. 

A  complete  OFFL  instruction  which  does  not  make  reference 
to  main  memory. 

An  operation  of  one  to  five  bytes  as  specified  in  the 
previous  section  that  does  not  refer  to  main  memory. 
All  the  operands  in  a  complete  non-memory  OFFL  instruction. 
A  vector  buffer  or  scalar  buffer  result. 
An  operation  of  one  to  five  bytes  as  specified  in  the  pre- 
vious section  that  does  refer  to  main  memory. 
An  operand  which  specifies  a  main  memory  address,  including 
indexing  and  mode  specification. 

A  result  which  specifies  a  logical  vector  buffer  address. 
An  operand  which  specifies  a  logical  vector  buffer  address. 
A  result  which  specifies  a  main  memory  address,  including 
indexing  and  mode  specification. 

An  operand  giving  a  vector  buffer  or  scalar  buffer  address. 
An  operand  specifying  a  physical  scalar  buffer  address. 
A  result  specifying  a  physical  scalar  buffer  address. 
A  two-byte  operand  which  specifies  a  memory  box  and  an 
address  within  the  box. 


Symbol 


RM  Address 
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TABLE  17   OFFL  SYNTAX  (cont.) 
Meaning 


A  two-byte  result  which  specifies  a  memory  box  and  an 
address  within  the  box. 
Index  and  Mode   A  string  containing  0  or  1  vector  operands  and  0  to  2 

scalar  operands  which  serve  as  indexes  and  modes  relative 
to  the  physical  memory  address  specified  by  the  associated 
memory  address.  All  indexing  is  limited  to  addresses 
within  a  single  memory  box. 


The  OFFL  syntax  follows. 

Instruction  : : =  Memory  |  Not  Memory; 
Not  Memory  ::=  NM  Operator  NM  Operand  NM  Result; 

Memory.  ::=  M  Operator  M  Operand  V  ResuU  |  M  Operator  V  Operand  M  Result; 
NM  Operand  ::=  VS  Operand  |  VS  Operand  VS  Operand; 
VS  Operand  : : =  V  Operand  |  S  Operand; 
NM  Result  : : =  V  Result  |  S  Result; 
M  Operand  : : =  Index  and  Mode  OM  Address; 
M  Result  : : =  Index  and  Mode  RM  Address; 

Index  and  Mode  ::=  S  Operand  |  V  Operand  |  V  Operand  |  S  Operand  | 

V  Operand  S  Operand  S  Operand; 
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TABLE  18   OFFL  INSTRUCTION  CONSTRAINTS 

Instruction  Parameter  Minimum   Maximum 

Instruction  length  in  bytes 

Number  of  scalar  operands 

Number  of  vector  operands 

Number  of  memory  operands 

Number  of  results,  any  type 

Operator  length  in  bytes 

Memory  operand  or  result  length  in  bytes 

Total  length  of  all  operands  for  a  non- 
memory  instruction  in  bytes 


3 

12 

0 

2 

0 

2 

0 

1 

1 

1 

1 

5 

2 

5 
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The  physical  paths  between  the  MIDs  and  the  IUD  must  be  of  some  fixed 
size.  At  this  point  we  need  to  translate  an  emergence  rate  of  four  instruc- 
tions per  clock  into  a  size  for  these  paths.  Since  the  path  size  determined 
will  be  a  fundamental  physical  limit  to  the  IUD's  processing  rate,  we  will 
design  the  IUD  to  handle  the  full  bandwidth  of  these  paths.  The  tradeoff 
in  deciding  on  this  parameter  is  the  possibility  of  the  IUD  slowing  down  the 
EUs  versus  the  cost  of  the  IUD.  Because  of  its  pipelined  parallel  nature 
and  the  sophisticated  functions  it  must  perform  as  outlined  in  Section  4.6.2, 
the  cost  of  the  IUD  rises  dramatically  with  increased  bandwidth.  This  will 
become  even  more  evident  in  the  remainder  of  Section  4.6.3  as  we  do  detailed 
logical  design.  The  emergence  rate  of  instructions  required  is  actually 
3  1/2,  not  4,  and  the  assumptions  that  gave  rise  to  that  figure  were  some- 
what conservative.   (See  Section  4.6.2.1  for  details.)  Thus,  a  bandwidth 
of  12  bytes  per  clock  total  coming  into  the  IUD  seems  to  be  a  reasonable 
figure  that  will  allow  for  little  or  no  delay  in  the  EUs.  There  are  addi- 
tional reasons  for  choosing  the  precise  figure  12.  These  will  become  appa- 
rent in  the  remainder  of  this  and  the  following  section. 

The  IUD  is  required  to  perform  various  types  of  operations  in  parallel 
as  outlined  in  Section  4.6.2.  We  will  now  determine  what  degree  of  paral- 
lelism will  be  required.  As  a  first  step,  we  will  determine  how  many  of  the 
various  types  of  operations,  operands,  and  results  may  occur  in  segments  of 
the  instruction  stream  of  different  lengths.  Table  19  gives  maximum  counts 
versus  instruction  stream  length  and  can  be  easily  derived  from  the  instruc- 
tion constraints  listed  above.  This  table  gives  the  maximum  number  of 
instruction  components  that  may  occur  in  a  length  of  instruction  stream 
segment.  Note  that  12  is  a  particularly  good  number  since  at  13  most  of  the 
counts  go  up  by  1 .  This  table  will  also  be  used  in  the  next  section. 
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TABLE  19   INSTRUCTION  STREAM  CONSTRAINTS 


Instruc- 

Operat 
Number 

ars 
Size 

Memory  Operands 
and  Results 
Combined 

Number   Size 

Vector 

and  Scalar 

tion 
Stream 

Operands* 

Results* 

Length 

Number 

Size 

Number 

Size 

1 

1 

1 

1 

1 

1 

1 

1 

1 

2 

1 

2 

1 

2 

2 

2 

1 

1 

3 

1 

3 

2 

2 

2 

2 

1 

1 

4 

2 

4 

2 

3 

2 

2 

2 

2 

5 

2 

5 

2 

4 

3 

3 

2 

2 

6 

2 

5 

2 

4 

4 

4 

2 

2 

7 

3 

5 

3 

4 

4 

4 

3 

3 

8 

3 

6 

3 

5 

4 

4 

3 

3 

9 

3 

7 

3 

6 

5 

5 

3 

3 

10 

4 

8 

3 

6 

6 

6 

4 

4 

11 

4 

9 

4 

6 

6 

6 

4 

4 

12 

4 

10 

4 

7 

6 

6 

4 

4 

13 

5 

10 

4 

8 

7 

7 

5 

5 

14 

5 

10 

4 

8 

8 

8 

5 

5 

15 

5 

11 

5 

8 

8 

8 

5 

5 

16 

6 

12 

5 

9 

8 

8 

6 

6 

*These  columns  refer  to  either  of  the  specified  types  and  not  to  both 
types  combined. 
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4.6.3.1.4   Switching  Instruction  Components  into  the  Pipe 

The  first  stages  of  the  IUD  pipe  will  consist  of  units  designed  to 
process  each  of  the  various  instruction  components.  Since  these  components 
may  occur  at  any  point  in  each  12-byte  segment,  we  need  some  means  of  trans- 
mitting the  various  components  to  the  appropriate  type  of  processors.  At 
the  same  time  we  must  maintain  the  identity  and  sequence  of  the  original 
instructions.  We  will  number  the  instructions  0  through  3  and  assign  this 
index  to  each  instruction  component.  Either  or  both  instructions  0  and  3 
may  be  incomplete.  Thus,  in  the  later  stages  of  the  pipe  we  must  take  this 
into  account.  Table  21  gives  the  logic  equations  for  assigning  instruction 
numbers  and  an  associated  gate  count. 

The  component  pipes  that  our  12  instruction  bytes  may  need  to  be 
switched  into  are  listed  below. 

TABLE  20   PIPE  COMPONENTS 


Component 

Number 

of 

Units 

Size  of  Unit 
in  Bytes 

Operator 

4 

5 

Memory  Operands 
Results 

and 

4 

2 

Vector  Operand 

6 

1 

Vector  Result 

4 

1 

Scalar  Operand 

6 

1 

Scalar  Result 

4 

1 
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TABLE  21   LOGIC  EQUATIONS  AND  GATE  COUNTS  FOR 
ASSIGNING  INSTRUCTION  NUMBERS 

The  symbols  A-L  represent  a  logical  input  as  to  whether  the  correspond 
ing  instruction  byte  is  the  first  byte  of  a  new  instruction.  This  can  be 
determined  by  bit  0  of  the  preceding  instruction  being  0  and  bit  1  of  this 
byte  being  1.  Q,  R,  and  S  are  the  following  logical  functions: 

Q  =  B  E  H  K 
R  =  I  V  J  V  K 
S  =  I  V  J 

X,  Y,  T,  and  W  are  instruction  bits  which  are  used  in  other  equations. 


High-order  Bit  of  Low-order  Bit  of  Number 

Instruction  Number  Instruction  Number  of  Gates 

A  0  0  6 

BO  B  1 


c  0  B  V  C 


2 


D  0  BVCVD  3 

E  X  =  BE  Y  =  'v(BEVBCD)  7 

F  X  V  Y  F  YFVYF  7 

G  X  V  Y  F  V  Y  G  YFGVYFVYG  12 

H  T  =  XVYFVYGVYH    W  =  Y  F  G  H  V  Y  F  V  Y  G  V  Y  H  18 

I  TVWI  WlVWI  7 

J  TVWIVWJ  WTJVWIJVWTJ  12 

K  QTVQWIVQWJVQWK   WTJT^V¥lV¥JV¥QK  23 

L  QTVTSVTKQVTL  W  R  L  V  W  R  L  V  W  R  L  18 


TOTAL     110 
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TABLE  21   LOGIC  EQUATIONS  AND  GATE  COUNTS  FOR 
ASSIGNING  INSTRUCTION  NUMBERS  (cont.) 

With  fan  out  of  20  and  fan  in  of  4,  the  entire  decoding  can  be  implemented 
in  7  levels  of  logic.  The  gate  counts  in  this  case  would  be  110  plus  9 
for  Q,  R,  and  S,  plus  24  to  generate  the  initial  values  A-L.  As  a  practi- 
cal manner,  at  least  one  additional  level  of  logic  and  a  slightly  higher 
gate  count  is  likely  to  be  required  to  avoid  the  large  fan  out.  It  may  be 
possible  to  implement  the  decoder  in  one  minor  clock  with  under  200  gates, 
but  two  minor  clocks  may  be  required. 
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TABLE  22   LOGIC  EQUATIONS  FOR  THE  CONTROL  OF 
THE  IUD  FRONT  END  SWITCHES 


Variables  A  -  E  represent  logical  values  associated  with  byte  positions. 
They  are  true  if  the  corresponding  byte  is  of  the  type  this  switch  fetches 

Variables  xG  -  xK  (where  x  is  one  of  A  -  E)  represent  enables  of  switch 
paths.  In  particular,  AG  equal  true  enables  the  path  from  byte  position  A 
to  the  first  pipe  entry  of  the  type  the  switch  is  for.  Similarly,  BH 
enables  the  path  from  the  second  byte  position  to  the  second  pipe  entry. 
The  designs  will  not  be  for  fully  general  switches.  They  will  take  ad- 
vantage of  restrictions  from  Table  19. 


I.  Control  for  4x2  switch  used  by  vector  and  scalar  operands. 

AG  =  A  DH  =  D 

BG  =  AB  CH  =  DC 

CG  =  ABCD  BH  =  DCAB 

(Note:  Paths  AH  and  DG  are  not  required,  saving  both  switch  and 
control  logic.) 

II.  Control  for  3  x  1  switch  used  by  vector  and  scalar  results. 

AG  =  A 
BG  =  B 
CG  =  C 

III.  Control  for  6x4  selector  used  for  memory  operands  and  results 
combined. 

AG  =  A  FJ  =  F 

BG  =  AB  EJ  =  FE 

CG  =  ABC  DJ  =  FED 

BH  =  AB  EI  =  FE 

CH  =  ABC  v  ABC                         DI  =  FED  v  FED 
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TABLE  22   LOGIC  EQUATIONS  FOR  THE  CONTROL  OF 
THE  IUD  FRONT  END  SWITCHES  (cont.) 


IV.  Control  for  6x5  selector  used  for  operators. 
AG  =  A  FK  =  F 

BG  =  AB  EK  =  FE 

BH  =  AB  EJ  =  FE 

CH  =  ABC  v  ABC  DJ  =  "FED  v  FFD 

CI  =  ABC  CI  =  FED 
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A  full  cross-bar  switch  for  this  transfer  would  need  to  be  12  bytes 
by  56  bytes.  The  logic  to  control  this  switch  would  be  particularly  cum- 
bersome. Referring  to  the  list  of  components  versus  instruction  length  in 
the  previous  section,  we  see  that  there  is  a  possibility  of  partitioning 
the  12  instructions  into  4  groups  of  3  each  in  the  case  of  vector  and  scalar 
results.  In  the  case  of  vector  and  scalar  operands,  3  partitions  of  length 
4  makes  sense.  In  the  case  of  memory  operands  or  results  and  in  the  case  of 
operators,  two  partitions  of  length  6  will  work.  All  these  partitions  have 
the  advantage  of  not  requiring  any  additional  processing  units,  while  at  the 
same  time  reducing  the  complexity  of  the  switch  and  its  control. 

With  these  smaller  partitions,  we  can  design  very   simple  and  fast  con- 
trol logic  for  the  switches.  The  basic  idea  of  the  design  is  to  start  at 
both  ends  and  work  towards  the  middle.  Thus,  if  we  are  checking  4  bytes, 
2  of  which  may  be  of  the  same  type,  then  the  first  output  path  accessed  by 
this  switch  will,  in  a  sense,  be  assigned  to  the  first  two  bytes,  and  the 
other  output  path  to  the  remaining  bytes.  The  logic  will  be  symmetric 
around  the  middle.  Detailed  logical  design  of  control  for  all  the  parti- 
tions required  is  contained  in  Table  22. 

At  this  point  we  can  complete  the  design  of  the  front  end  of  the  IUD 
pipe.  There  is  a  12-byte  wide  data  path  into  the  IUD  which  inputs  a  new 
segment  of  the  instruction  stream.  There  are  registers  to  receive  this  in- 
put and  a  second  set  of  registers  to  serve  as  a  buffer  if  the  IUD  becomes 
blocked.  It  takes  longer  than  1  minor  clock  to  notify  the  MIDs  that  there 


150 


FROM 
MID 


INSTRUCTION 

INDEX 

GENERATOR 


> 


BLOCKAGE 
BUFFER 


SWITCH 
CONTROLLERS 


FIGURE  17 
IUD  FRONT  END 


TO 
SWITCHES 


6x4  MEMORY 
OPERANDS  RESULTS 

6x5 
OPERATIONS 

4x2  VECTOR 
OPERANDS 

4x2  SCALAR 
OPERANDS 

3x1  VECTOR 
RESULTS 

3x1  SCALAR 
RESULTS 

4x2  VECTOR 
OPERANDS 

4x1  SCALAR 
OPERANDS 

oo 

LU 
O- 
1— 1 

Q 

3x1  VECTOR 
RESULTS 

1 — I 
CD 

t— i 

3x1  SCALAR 
RESULTS 

Q 

o 

OO 

LU 

3x1  VECTOR 
RESULTS 

a. 

en 
o 
o 

o 

3x1  SCALAR 
RESULTS 

1— 

4x2  VECTOR 
OPERANDS 

4x2  SCALAR 
OPERANDS 

3x1  VECTOR 
RESULTS 

3x1  SCALAR 
RESULTS 

6x4  MEMORY 
OPERANDS  RESULTS 

6x5 
OPERATIONS 

151 


TABLE  23   GATE  COUNT  FOR  IUD  FRONT  END 


Unit 

16  bit  x  12  byte 
buffer 

19  bit  x  12  byte 
buffer 

Instruction  index 
generator 


How  Derived 
4  gates/bit 

4  gates/bit 

Table 


Number 
Gate  Count    Units 

768         2 


912 


200 


Total 
1536 


1824 


200 


Switch  Controllers  (SK)  and  Switches  (S) 
4  x  2  S 

3  x  1  S 
6  x  4  S 
6  x  5  S 

4  x  2  SK 

3  x  1  SK 
6  x  4  SK 
6  x  5  SK 


(no.  of  bits) 

• 

608 

6 

3648 

(lines  in)  • 

(lines  out)  • 

4 

same  as  above 

228 

8 

1824 

same  as  above 

1536 

2 

3072 

same  as  above 

2280 

2 

4560 

Table 

28 

6 

168 

(number  of  variabl 

es 

in  equation  ti 

mes 

2) 

same  as  above 

6 

8 

48 

same  as  above 

56 

2 

112 

same  as  above 

56 

2 

112 

TOTAL 


17104 
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is  a  block.  However,  all  but  one  minor  clock  of  this  delay  is  buffered  by 
the  path  transducer  discussed  in  Section  3.1.3.  The  following  functions 
are  performed  by  the  IUD  front  end. 

(1)  Instruction  indexes  are  generated  and  produced  for  each  instruc- 
tion byte. 

(2)  One  clock  later,  control  signals  for  the  various  partitioned 
switches  are  generated. 

(3)  One  clock  later,  the  instructions  meet  their  indexes. 

(4)  One  clock  later,  the  instruction  components  are  switched  into 
the  various  pipes. 

Figure  17  gives  the  overall  structure  of  the  IUD  front  end.  Table  23  pro- 
vides a  gate  count  of  the  IUD  front-end. 

4.6.3.1.5   Global  Structure  of  IUD  Pipe 

We  have  just  seen  how  the  stream  of  IUD  instructions  is  broken  up  into 
bytes  of  various  types  by  the  IUD  front  end.  This  breaking  up  is  done  in 
such  a  way  that  the  identity  of  the  complete  instruction  can  be  recovered 
later.  In  this  section  we  will  describe  the  overall  structure  and  timing 
of  the  IUD  pipe  as  it  performs  the  functions  outlined  in  Section  4.6.2.  In 
the  remainder  of  Section  4.6.3  we  will  provide  detailed  logical  design  of 
the  various  components  of  the  pipe  as  well  as  gate  counts. 

Table  24  gives  a  list  of  functions  to  be  performed  (including  how  many 
parallel  units  are  required),  the  delays  involved,  and  the  dependency  rela- 
tionships. This  table  is  then  used  to  generate  Table  25,  which  describes 
the  timing  sequence  for  all  functional  components  in  the  IUD  pipe.  Finally, 
using  these  two  tables,  we  construct  Figure  18  which  is  an  overall  diagram 
of  the  pipeline  components. 
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TABLE  24   IUD  PIPE  FUNCTIONS,  TIMINGS,  AND  DEPENDENCIES 


Function 

Section 

Where 

Outlined 

Abbre- 
viation 

Time 

Number  of 

Parallel 

Units 

Requires 

Output 

From 

Use  scalar  result  to 
update  parallel 
search  portion  of 
scalar  table 

4.6.2.3 

SPU 

1 

4 

none 

Use  vector  result  to 
update  parallel 
search  portion  of 
vector  table 

4.6.2.4 

VPU 

1 

4 

none 

Use  scalar  result  to 
search  scalar  table 

4.6.2.3 

SST 

2 

6 

SPU 

Use  vector  operand 
to  search  vector 
table 

4.6.2.4 

SVT 

2 

6 

VPU 

Select  scalar 
execution  unit 

4.6.2.5 

SSE 

1 

4 

none 

Select  vector 
execution  unit 

4.6.2.6 

SVE 

2 

4 

SVT 

Reserve  vector 
buffer  storage 

4.6.2.7 

RVS 

1 

6 

SVE 

Update  scalar 
operand  table 

4.6.2.3 

US 

2 

6 

none 

Update  vector 
operand  table 

4.6.2.4 

UV 

2 

6 

RVS 

Generate  vector 
switch  operations 

4.6.2.7 

GSI 

2 

9 

FVO 

Fill  in  vector 
operand  fields  not 
known  at  SVT 

4.6.2.7 

FVO 

2 

6 

RVS 

Assemble  complete 
memory  instructions 

4.6.2.8 

AM 

2 

3 

FVO 
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TABLE  24   IUD  PIPE  FUNCTIONS,  TIMINGS,  AND  DEPENDENCIES  (cont.) 


Function 

Assemble  complete 
scalar  instructions 

Assemble  complete 
vector  instructions 

Initiate  buffered 
transfer  of  vector 
switch  instructions 

Initiate  buffered 
transfer  of  memory 
instructions 

Initiate  buffered 
transfer  of  scalar 
instructions 

Initiate  buffered 
transfer  of  vector 
instructions 


Section 

Where 

Outlines 


4.6.2.8 
4.6.2.8 

4.6.2.8 

4.6.2.8 

4.6.2.8 

4.6.2.8 


Abbre- 
viation 


AS 
AV 


ISW 


IM 


ISC 


IV 


Time 
2 
2 


Number  of 

Parallel 

Units 


Requires 

Output 

From 


FVO 


FVO 


GSI 


AM 


AS 


AV 


• 


; 
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TABLE  25   IUD  PIPE  TIMING  CHART 


Stage  of  Pipe 

Functions 

Just  Completed 

Functions  Which  May  Begin 

1 

SPU  VPU  SSE  US 

2 

SPU 

VPU  SSE 

SST  SVT 

3 

US 

4 

SST 

SVT 

SVE 

5 

6 

SVE 

RVS 

7 

RVS 

UV   FVO 

8 

9 

UV 

FVO 

AM   AS   AV   GSI 

10 

11 

AM 

AS   AV   GSI 

IM   ISC  IV   ISW 

12 

IM 

ISC  IV   ISW 
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Entry  Port 
from  IUD 

Number 
of 

Number 
Bytes/ 

Functions 

vs  Pi 

pe  Stage 

Front  End 

Ports 

Port 

J_ 

2 

3 

4 

5 

6 

7 

Operator 

2 

5 

SSE 

H 

H 

*SVE 

H 

H 

H 

Memory  Operands 
and  Results 

2 

4 

H 

H 

H 

H 

H 

H 

H 

Vector  Operands 

3 

2 

H 

SVT 

H 

*SVE 

H 

RVS 

GSI 
FVO 

Scalar  Operands 

3 

2 

H 

SST 

H 

H 

H 

H 

H 

Vector  Results 

4 

1 

VPU 

H 

H 

H 

H 

RVS 

UV 
GSI 

Scalar  Results 

4 

1 

US 
SPU 

H 

H 

H 

H 

H 

H 

H  means  hold  and  pass  on  to  the  next  stage  with  no  function  initiated. 

*SVE  indicates  that  SVE  requires  both  these  inputs  and  not  that  SVE  is 
cone  for  each  as  is  the  case  in  other  duplications  down  a  column 

No  functions  are  initiated  at  stage  8.  At  stage  9,  we  initiate  another 

set  of  switches  like  that  in  the  IUD  front  end  in  order  to  assemble  complete 

instructions.  We  will  describe  this  tail  end  of  the  pipe  in  section  4.6.3.3, 


FIGURE  18   IUD  PIPE  OVERALL  STRUCTURE 
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4.6.3.2   Detailed  Structure  and  Gate  Counts  for  Internal  IUD  Pipe 
Functions 
In  this  section  we  will  present  gate  counts  and,  where  necessary,  logi- 
cal design  for  the  functions  listed  in  Table  24  up  to  the  point  where  we 
begin  assembling  complete  instructions.  We  will  do  logical  design  to  the 
minimum  degree  of  detail  required  to  obtain  reasonably  accurate  gate  counts. 
Many  of  the  units  we  discuss  will  be  involved  with  several  functions.  We 
will  introduce  each  unit  as  required.  At  the  end  of  this  section  we  will 
provide  a  summary  of  these  units  and  their  interconnections  as  well  as  a 
total  gate  count  for  the  internal  IUD  pipe. 

4.6.3.2.1   Details  of  the  Parallel  Update  of  the  Scalar  Table  (SPU) 

As  discussed  in  Section  4.6.2.3,  there  must  be  a  comparison  tree  for 
assigning  time  indexes  to  scalar  operands  which  may  coincide  with  scalar 
results  being  processed  in  parallel  with  the  operands.  From  Table  25  we 
see  that  the  complete  update  of  the  scalar  table  (US)  is  complete  at  clock 
3.  Since  the  search  of  the  scalar  table  (SST)  does  not  begin  until  clock 
2,  the  comparison  tree  we  are  designing  need  only  contain  the  scalar  results 
being  processed  in  one  clock.  This  is  a  maximum  of  4  (see  Table  19).  The 
information  contained  in  this  comparison  tree  must  include  the  physical 
address  of  the  scalar  result  and  the  time  index  for  the  instruction.  In 
the  next  section  we  will  discuss  the  hardware  for  generating  time  indexes 
for  the  various  classes  of  instructions.  For  now  we  will  assume  they  are 
directly  accessible  by  using  the  instruction  indexes  discussed  in  Section 
4.6.3.1.4.   The  loading  of  this  comparison  tree  consists  of  gating  the 
required  information  into  the  associated  registers  and  clearing  any  of  the 
4  registers  not  used.  This  is  the  SPU  function.  This  comparison  tree  is 
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a  -  d 

Match  Inputs 

A  -  D 

Switch  Control  Outputs 

N 

No  Match  Output 

N  = 

a  bed 

A  = 

a  b  c  d 

B  = 

bed 

C  = 

c  d 

D  = 

D 

FIGURE  19   COMPARISON  TREE  (cont.) 
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TABLE  26   SCALAR  COMPARISON  TREE  GATE  COUNT 


Unit 

Compare  for  Identical 
Physical  Addresses 
(13  bits) 

Last  Match  Selector 

4  x  1  Switch 
(9  bits) 


Gate  Count 

30  +  15  +  4  =  49 
(3  levels  of 
logic) 

14 

3  *  9  *  4  =  108 


Number/   Number/ 
Operand   Result   Total 


1 


1176 

84 
648 


TOTAL 


1908 
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then  used  by  the  SST  function  in  the  next  clock.  It  must  compare  in  paral- 
lel up  to  6  scalar  operand  physical  addresses.  In  the  case  of  a  match,  it 
must  select  the  largest  time  index  less  than  the  time  index  of  the  instruc- 
tion containing  the  operand.  Since  the  actual  time  indexes  can  be  obtained 
from  the  instruction  index,  and  since  the  instruction  indexes  are  in  the 
same  sequence  as  the  time  indexes,  the  tree  need  only  contain  and  work  with 
the  2-bit  instruction  indexes.  Figure  19  gives  the  structure  of  the  com- 
parison tree.  Table  26  gives  the  gate  count.  No  additional  registers  are 
required  for  this  unit  since  all  required  information  is  in  registers  in 
other  units  or  at  various  stages  in  the  pipe.  It  should  be  possible  to  per- 
form the  entire  operation  in  six  levels  of  logic  and  under  one  clock.  Thus, 
any  internal  pipelining  of  this  unit  is  not  required.  The  SPU  function  thus 
becomes  almost  null.  It  is  moved  up  one  clock  in  the  pipe  and  exists  because 
scalar  results  are  used  in  the  SST  function. 

4.6.3.2.2   Generating  Time  Indexes 

The  need  for  generating  instruction  time  indexes  arises  from  two 
sources.  First,  scalar  physical  addresses  require  a  time  index  to  uniquely 
identify  them.  This  was  discussed  in  detail  in  Section  4.3.  The  second  re- 
quirement for  time  indexes  is  contained  entirely  within  the  IUD.  It  is  not 
until  clock  7  that  all  the  information  necessary  to  update  the  vector  table 
is  available.  All  vector  operands  corresponding  to  results  during  this 
period  and  the  additional  two  clocks  required  to  update  the  vector  table 
will  not  contain  correct  vector  addresses.  In  fact,  their  address  field 
will  be  set  to  zero  to  indicate  that  they  still  must  be  updated.  Thus,  time 
indexes  must  be  assigned  to  vector  store  instructions  to  guarantee  that  the 
correct  vector  operand  addresses  may  ultimately  be  obtained.  To  save 
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SWITCH  CONTROL  (1) 

i  =  First  instruction  is  of  specified  type  and  is  not 
a  continuation  of  a  previous  instruction 

j  =  Second  instruction  is  of  specified  type 

k  =  Third  instruction  is  of  specified  type 

I    =  Fourth  instruction  is  of  specified  type 

no  change     =    T  J  k"  J 

A    =    i  J  k I    V    TjkZ    V    TJkl    v    T  J  k I 

B    =     i  j  k  I    V     i  J  k  I    V     i  J  k  I    V 
Tjkl    V    J  j  k  I    V    TJkl 

C     =     i  j  k  I    V     i  j  k  I    V     i  J  k  I    V    T  j  k  I 
D    =     i  j  k  I 


FIGURE  20   TIME  INDEX  LOGIC  (cont.) 
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SWITCH  CONTROL  (2) 

i,  j,  k,  and  £  are  the  same  as  in  Switch  Control  (1) 

SA     =  T 

SB     =  TJ 

SC     =  T  J  k 

SD     =  TJkl 

AW     =  i 

AX     =  i  J    V    J  i 

AY     =  i  J  k    V    T  J  k    V    T  j  k 

AZ  =  ijkl  V  TJkl  V  Tjk!  V  TJkl 

BX  =  i  j 

BY  =  i  j  k  V  i  J  k  V  T  j  k 

BZ  =  ijkl  V  i  J  k  I  V  Tjk  I  V 
TJkl    V  ijkl    V  Tjk£ 

CY  =  i  j  k 

CZ  =  ijkl  V  i  j  k  I    V  i  J  k  I    V  T  j  k  I 

DZ  =  i  j  k  I 


FIGURE  20   TIME  INDEX  LOGIC  (cont.) 
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hardware  costs,  we  wish  to  keep  the  range  of  possible  indexes  to  a  minimum. 
We  then  need  to  do  the  indexing  in  a  circular  manner.  In  order  to  keep  the 
compare  logic  to  a  minimum,  we  will  add  an  additional  bit  to  the  minimum 
word  size  needed.  We  can  then  do  circular  indexing  by  alternately  consider- 
ing this  high-order  bit  or  its  complement  as  the  most  significant  bit  of 
the  index.  That  is,  whenever  we  use  up  all  indexes  and  start  over,  the  new 
indexes  we  assign  will  have  this  bit  set  opposite  to  what  it  was  just  before 
we  started  over.  Numbers  whose  high-order  bit  is  the  same  as  that  we  are 
now  assigning  will  all  be  considered  greater  than  numbers  with  the  opposite 
value  for  this  high-order  bit.  In  the  case  of  vector  results,  a  maximum 
of  36  may  be  processed  in  9  clocks.  Thus  a  total  of  7  bits  will  be  required. 
In  both  cases  of  vector  and  scalar  results,  we  only  need  to  time  index  in- 
structions with  results  of  the  specified  type.  Generating  these  indexes 
requires  logic  similar  to,  but  a  bit  more  complex  than  that  required  for 
generating  the  instruction  numbers.  This  logic  can  operate  in  parallel  with 
the  IUD  front-end  logic  and  thus  has  three  clocks  available  to  it.  Once 
generated,  these  indexes  will  be  carried  along  through  the  pipe  in  their  own 
set  of  registers.  They  will  be  accessible  at  any  time  merely  by  providing 
the  instruction  index  as  an  address  to  the  registers  containing  the  time 
indexes.  In  the  case  of  instructions  that  do  not  have  a  result  of  the  speci- 
fied type,  they  will  receive  an  instruction  index  that  is  one  greater  than 
the  most  recent  instruction  with  a  result  of  the  specified  type.  This  will 
be  the  same  index  as  the  next  instruction  with  a  result  of  that  type.  Fig- 
ure 20  gives  the  structure  of  the  time  index  logic.  Table  27  gives  a  gate 
count  for  the  various  index  generators  which  may  be  required. 
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TABLE  27   GATE  COUNT  FOR  TIME  INDEX  GENERATORS 


9  Bits 


7  Bits 


Unit 

Number 
Units 

Gate 
Unit 

Coui 

it/ 

Total 

Gate 
Unit 

Coui 

it/ 

Total 

Bit  0-4  (9) 
Bit  0-2  (7) 

1 

50 

50 

30 

30 

Plus  1  through 
plus  4  adders 

4 

30 

120 

30 

120 

Carry  or  no 
carry  switches 

4 

12 

36 

12 

36 

Logic  to  alter 
meaning  of  most 
significant  bit 

1 

10 

10 

10 

10 

4  Output  Counter     1 

216 

196 

Switch  Control  (1)    1 

64 

64 

64 

64 

Switch  (1)         1 

5*9*3=135 

135 

5*7*3=115 

115 

Switch  Control  (2)    1 

98 

98 

98 

98 

Switch  (2)         1 

14*9*3=378 

378 

14*7*3=294 

294 

Registers          6 

36 

216 

28 

168 

TOTALS 


1107 


935 
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FROM  VECTOR 
TIME  INDEX  PIPES 
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4  x  36 

SWITCH 


J  TREE 
REGISTER 


TREE 
REGISTER 


FROM  VECTOR 
OPERAND  PIPES 


36  PIPE  REGISTERS 


TREE 
REGISTER 
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COMPARISON 
TREE  LOGIC 
AND 
36  x  1 
SWITCH 
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LU 
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TREE 

CONTROL 
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m             i 

NUMBER  OF  RESULTS 
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i 

NUMBER  OF 
wrpTnn      m 

NEXT 

CDCC 

NEXT 

. 

RESULTS 
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TER 

FIGURE  21   VECTOR  COMPARISON  TREE 
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NEXT 
FREE 
COUNTER 


6  BIT 

36  OUTPUT 

ADDRESS 

DECODER 


36  TIMES 


CONTROL  TO 
APPROPRIATE 
SWITCH 
PATHS 


I  TO  4t: 
FANOUTj- 

1 J~ 


4  x  36  SWITCH  CONTROL  LOGIC  DETAIL 


FIGURE  21   VECTOR  COMPARISON  TREE  (cont.) 


170 


NUMBER  OF  INSTRUCTIONS  WITH 
VECTOR  RESULTS  COMPLETELY  UPDATED 
IN  VECTOR  STATUS  TABLE  IN  THIS  CLOCK 


EACH  LINE  CONTAINS 
4  CONSECUTIVE  INPUTS 
FROM  ADDRESS  DECODER 


P 
U 
R 
G 
E 

S 

I 
G 
N 
A 
L 
S 


COMPARISON  TREE  PURGE  LOGIC 


FIGURE  21   VECTOR  COMPARISON  TREE  (cont.) 


171 


FROM 
TREE 
REGISTERS 


36 

COMPARE 

ELEMENTS 


3  LEVEL 
LAST 
MATCH 
SELECTOR 


36  x  1 

SWITCH 


TIME 
INDEX 


VECTOR  COMPARISON  TREE  LOGIC  DETAIL 


FIGURE  21   VECTOR  COMPARISON  TREE  (cont.) 
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IT 


PASS/ 
NO  PASS 
UNIT 

16 
ELEMENTS 


- 


LMS 

N  OUTPUTS  FROM 
OTHER  LEVEL  2  LMS 


LMS  =  Last  Match  Selector 

PNP  =  Pass/No  Pass  Unit  (4  elements) 

3  LEVEL  LAST  MATCH  SELECTOR  SHOWING 
DETAIL  FOR  FIRST  16  OF  36  OUTPUTS 


PASS/ 
NO  PASS 
UNIT 

16 
ELEMENTS 


TO  OTHER  16  ELEMENT 
PASS/NO  PASS  SELECTOR 


FIGURE  21 


VECTOR  COMPARISON 
TREE  (cont.) 
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A  -  D    are  address  decoder  inputs 

X,  Y,  Z    are  numbers  of  complete  inputs 
P      is  output 

P  =  AZ  V  BX  V  c  YZ  V  DX 


FIGURE  21   VECTOR  COMPARISON 
TREE  (cont.) 
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TABLE  28   VECTOR  BUFFER  COMPARISON  TREE  GATE  COUNT 


Unit 

4  x  36  Switch  Control 

Purge  Logic 

Next  Free  Counter 

Next  Purge  Counter 

4  x  36  Switch 

36  Registers 


Gate  Count 

16+2*36+4*36  =  232 

16+2*36+8*36  =  376 

100 

100 

4*36*20*4  =  11520 

4*36*20  =  2880 


Comparison  Tree  Logic 


Unit 

Number 

Gate  Count/ 
Unit 

Total 

LMS 

9+2+1  =  13 

14 

182 

4  Input  PS 

9 

8 

72 

16  Input  PS 

3 

32 

96 

38  x  1  Switch 
(10  bits) 

1 

36*10*3 

1080 

COMPARISON 

TREE  TOTAL 

1436 

6  TREES 

8616 

UNIT  TOTAL 

23,824 
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4.6.3.2.3   Parallel  Update  of  Vector  Buffer  Table  (VPU) 

As  mentioned  in  the  previous  section,  up  to  36  vector  buffer  results 
may  be  in  an  incomplete  state  inside  the  IUD  pipe.  The  search  of  the 
vector  buffer  table  (SVT)  must  be  able  to  detect  this  fact.  To  allow  this, 
we  need  a  comparison  tree  similar  to  that  described  in  Section  4.6.3.2.1. 
This  tree  must  be  much  larger  to  allow  for  36  entries.  Since  entries  can 
remain  in  the  tree  for  up  to  9  clocks,  we  need  some  additional  control  logic 
to  properly  purge  and  update  the  tree.  Its  functional  operation  is  iden- 
tical to  that  of  the  scalar  result  comparison  tree  described  in  Section 
4.6.3.2.1.  Figure  21  gives  a  diagram  of  this  tree.  Table  22  gives  a  gate 
count  for  the  tree. 

Most  of  Figure  21  is  self  explanatory.  The  three-level  last  match 
selector  does  require  some  additional  explanation.  It  is  built  from  the 
four  input  last  match  selectors  whose  logic  equations  occur  in  Figure 
Outputs  from  the  36  compares  are  routed  into  nine  of  these  last  match  selec- 
tors (LMS).  These  nine  are  divided  into  three  groups,  two  of  size  4  and  one 
containing  a  single  element.  The  match/no-match  output  (N)  from  each  ele- 
ment in  a  size  4  subgroup  is  one  of  the  inputs  to  another  LMS.  Finally,  the 
N  output  from  these  two  LMS  plus  the  output  from  the  solo  element  at  level  1 
are  fed  into  a  final  LMS.  The  N  output  from  this  final  LMS  is  the  N  output 
for  the  entire  comparison  tree.  The  four  select  outputs  from  each  LMS  are 
used  either  for  control  or  as  input  to  a  pass/no-pass  selector  (PS).  The 
level  1  LMS  output  are  inputs  to  nine  PS.  The  control  for  each  of  these  PS 
is  a  selected  output  from  a  level  2  LMS.  In  particular,  if  the  correspond- 
ing level  1  LMS  were  the  last  in  its  group  of  four  to  have  a  match,  then  its 
PS  will  be  enabled,  and  otherwise  not.  The  same  game  is  played  at  level  3, 
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but  each  PS  controls  16  inputs.  The  extra  level  1  LMS  is  handled  in  the 
obvious  way  to  minimize  package  costs  without  requiring  specially  designed 
units. 

4.6.3.2.4   Searching  the  Scalar  Table 

If  we  examine  Table  24,  we  see  from  the  dependency  column  that  no 
function  requires  output  from  SST.  We  also  know  from  Section  4.6.2.3  that 
close  communication  is  required  between  the  SEUs  and  the  circuitry  main- 
taining the  scalar  status  tables.  Thus  it  is  both  possible  and  desirable 
to  move  the  scalar  status  tables  and  the  SST  hardware  to  be  physically  just 
ahead  of  the  SEUs.  One  additional  advantage  in  doing  this  is  that  it  will 
no  longer  be  required  to  process  these  instructions  at  the  maximum  rate  that 
can  occur  in  the  IUD  pipe,  but  only  at  the  lower  rate  at  which  the  SEUs  can 
accept  them.  This  advantage  cannot  be  realized  in  the  case  of  vector  in- 
structions because  of  the  manner  in  which  logical  vector  addresses  are 
distributed  among  the  VEUs.  Memory  and  scalar  instructions  that  interact 
with  the  vector  unit  must  be  processed  together  in  a  manner  that  logically 
corresponds  to  the  actual  sequence  in  which  the  instructions  occur. 

We  will  move  the  SST  and  SPU  functions  into  the  SIDS  mentioned  in 
Section  4.3.2.1.  We  have  already  described  the  detailed  hardware  for  SPU 
which  was  then  used  in  the  next  section  on  the  VPU.  We  will  describe  in 
detail  the  remainder  of  the  hardware  in  Section  4.6.5  where  we  provide  a 
complete  description  of  the  SIDS. 
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4.6.3.2.5   Searching  the  Vector  Table  (SVT) 

The  vector  status  table  is  the  comparison  tree  described  in  Section 
4.6.3.2.4  and  an  associative  memory  that  maps  logical  buffer  addresses  to 
physical  buffer  addresses.  Since  most  of  the  accessing  of  this  table  is 
by  logical  addresses,  it  would  be  desirable  to  have  each  logical  vector  buf- 
fer address  be  a  physical  address  to  this  table.  At  a  given  instant  in  time, 
there  may  exist  in  the  queues  several  instructions  with  the  same  logical 
address.  However  only  the  most  recent  store  to  a  logical  vector  location 
is  required  for  this  table.  Thus  we  can  make  the  logical  address  of  a  vec- 
tor be  a  physical  address  to  the  status  table.  The  only  information  that 
will  then  be  needed  in  the  table  is  the  current  physical  address  of  the 
logical  buffer.  When  we  first  discussed  this  table  in  Section  4.6.2.4,  we 
described  it  as  an  associative  memory.  We  see  here  that  this  is  no  longer 
necessary.  In  that  previous  discussion,  we  mentioned  the  necessity  of  pro- 
viding use  counts  for  the  VEU  as  a  means  of  noting  when  a  logical  register 
was  available  for  re-use.  For  the  same  reasons  discussed  in  the  previous 
section,  these  functions  are  best  transferred  to  just  in  front  of  the  vector 
portion  of  the  machine.  We  will  describe  this  unit,  the  Vector  Instruction 
Dispatcher  Subsystem  (VIDS),  in  Section  4.6.6. 

There  are  two  functions  that  this  table  must  perform.  It  must  allow 
for  accesses,  in  parallel,  up  to  the  number  of  vector  operand  pipes.  There 
are  six  of  these  pipes.  Two  minor  cycles  are  allowed  for  this  access  and 
selecting  the  output  from  this  table  or  the  comparison  tree.  The  access 
must  be  pipelined  so  that  a  set  of  six  is  complete  in  es/ery  minor  cycle. 
The  same  table  must  be  able  to  accept  stores  in  parallel  at  the  rate  that 
the  vector  results  can  emerge  from  the  pipe.  Again,  two  clocks  are  allowed 
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TABLE  29   VECTOR  STATUS  TABLE  GATE  COUNT 


Unit 

Gate  Count 

Number  Units 

Total 

2  Level 
(128) 

Address  Decoder 

16*4+8*3+128*2 

=  344 

9 

3096 

2  Level 
(256) 

Address  Decoder 

2*16*4+256*2 

=  656 

9 

5904 

Bits  to 
Address 
(128) 

Store  Decoded 
for  Pipelining 

4*128 

=  512 

9 

4608 

Bits  to 
Address 
(256) 

Store  Decoded 
for  Pipelining 

4*256 

=  1024 

9 

9216 

Memory  Location,  16 
Bits  with  3  Way  Fan  In 
and  6  Way  Fan  Out 


16*(4+3+2+6)   =  240 


Comparison  Tree/ 

Memory  Selection  Logic   (4+1 6)*2+l 6*2*3=  136 


128 
256 


30720 
61440 


136 


Total  (128)  not  including  memory  7,840 

with  memory  38,560 

Total  (256)  not  including  memory  15,256 

with  memory  76,696 
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for  this  function,  but  it  must  be  pipelined  to  allow  a  set  of  stores  to  be 
processed  starting  at  every   clock.  To  do  this,  we  require  only  a  standard 
memory,  but  with  six  read  address  decoders  with  their  own  output  lines  and 
three  write  address  decoders  with  their  own  output  lines.  The  two-clock 
pipelining  can  consist  of  one  clock  for  address  decoding  and  one  clock  to 
do  the  read  or  write. 

Table  29  gives  a  gate  count  for  this  memory  and  the  circuitry  that 
selects  either  the  memory  output  or  the  comparison  tree  output. 

4.6.3.2.6   Selecting  the  Scalar  Execution  Unit  (SSE) 

This  is  another  function  which  can  and  should  be  moved  to  the  SIDS. 
However  since  this  design  is  a  simplified  version  of  the  hardware  for 
selecting  the  vector  execution  unit,  we  will  provide  a  detailed  design  at 
this  point.  We  will  take  advantage  of  the  lowered  processing  rate  required 
by  transfering  this  function  to  the  SIDS.  Thus  we  will  only  need  to  process 
instructions  at  the  rate  the  SEUs  can  process  them  or,  at  most,  one  per 
minor  clock. 

A  small  table  containing  the  current  queue  size  of  each  SEU  is  required. 
Whenever  an  instruction  has  completed  execution,  a  table  entry  must  be  decre- 
mented. Whenever  an  instruction  is  entered  into,  a  queue  table  entry  must 
be  incremented.  Instruction  codes  are  logical  addresses  referring  to  a  group 
of  functionally  identical  SEUs.  If  for  a  particular  logical  address  there  is 
only  one  element  in  the  specified  group,  then  the  SSE  function  is  simply  to 
convert  the  logical  address  to  a  physical  address,  and  if  the  queue  and  the 
SIDS  buffer  is  full,  send  a  signal  to  hold  up  the  IUD.  If  there  are  more 
than  one  functionally  equivalent  SEU  involved,  then  the  conversion  to  a 
physical  address  involves  selecting  the  unit  with  the  smallest  queue  and 
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holding  up  the  IUD  if  all  the  queues  and  the  SIDS  buffer  are  full.  Figure 
22  is  a  unit  to  perform  this  function  for  up  to  six  equivalent  SEUs. 
Table  30  provides  a  gate  count  for  this  unit. 

This  unit  must  provide  an  indication  of  the  smallest  queue  which  is 
updated  every   clock.  To  allow  for  this,  we  have  12  registers.  Six  of  these 
contain  the  current  queue  sizes.  These  are  divided  into  two  groups  of  three. 
For  each  of  these  groups  there  are  an  additional  set  of  three  registers  con- 
taining the  differences  in  queue  sizes.  These  differences  are  not  generated 
by  doing  subtracts,  but  rather  directly  from  the  increment  and  decrement 
signals  to  the  queue  sizes.  The  signs  of  these  differences  are  used  to  de- 
termine a  minimum  in  each  group  of  three  and  to  control  a  3  x  1  switch  to 
transfer  that  minimum  to  another  register.  In  one  minor  clock  all  register 
incrementing  and  decrementing  and  the  transfer  of  a  minimum  can  be  performed. 
In  the  next  clock  a  true  subtract  can  be  done  on  the  two  minima  generated  in 
the  previous  clock.  The  results  of  this  subtract  can  then  be  used  to  choose 
from  which  of  the  three  groups  the  global  minimum  is  to  be  chosen.  Because 
of  this  two-stage  pipelining,  we  do  not  necessarily  have  the  absolute  minima 
at  a  given  clock.  However  only  one  instruction  per  queue  can  complete  in  a 
major  clock,  and  only  one  queue  location  can  be  reserved  in  a  minor  clock. 
Thus  this  additional  one  clock  delay  will,  at  most,  allow  a  difference  of 
two  from  the  minimum,  and  this  cannot  happen  often. 
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FIGURE  22   SEU  QUEUE  SELECTOR 
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LOGIC  EQUATIONS  FOR  INTEGRATOR 

Output: 

S  =  Sign  (TRUE  means  negative) 
X  =  High  order  bit 

Y  =  Low  order  bit 
Input: 

Pq,P,  =  Increment  input 
Nq,N,  =  Decrement  input 

S     =    yjfi     v     VlF0     v     NQN1P1     v     N^ 

Y  -    y/^Q     v     N^P^     v     N^Fq     v     N^Pq     v 

Wipo    v    Wipo    v    ¥o¥o    v    WlP0 

X  =  NqN/^  v  N0NlPlP0 

LOGIC  EQUATIONS  FOR  a-b  PSEUDO  CARRY  DIFFERENCE  COUNTERS 

Input:   S,  X,  Y  just  defined  plus 

aQ,  a1 ,  a2,  a3,  Sd  (sign  and  4  bits  of  differences) 
Output:  Sr,  rQ,  r-j  ,  r2,  r3  (sign  and  4  bits  of  new  differences) 

Note:    X  and  Y  cannot  both  be  TRUE 


r3  =  a3Y  v  a3Y 

Sa  =  SSj  v  SS .  (TRUE  means  addition,  FALSE  means  subtraction) 


FIGURE  22   SEU  QUEUE  SELECTOR  (cont.) 


Add 


Subtract 
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r2  =  SaYa3^2  v  SaXa2    v  ^r2  ^  a2^ 
S  XYa2   v  S  XYa^  v  (r2  =  a2) 

S  Ya"3a2  v  S  Xa2    v  (r2  f   a2) 

\a2a3Y     V     \a2W        v     (r2  =  a2> 

CD     =     S  Ya0a.j  v     S  Xa9  (positive  carry) 

r  a     *.  o  a     £ 

CN     =    ^aYa2a3  v    \Xa2  (negative  carry) 

rl     =     alVN  v     alCP         v     alCN 

rQ     =    F^Cp  v     a^     v     a0CpCN     v     aoaiCN     v     a^Cp 

S       =     S.S       v  Addition 

r  da 

SdSaa0CPCN  v  SdSaa0CN  v  Subtraction,  no 

r  -F        r  c   T  -  r  cTTr         .,         Overflow 

SdSaalCN    v  SdSaa0CP  v  SdSaalCP  v 

VaaOalCP  v  Va¥lCN  Subtraction  Overflow 

LOGIC  TO  SELECT  MINIMUM  ELEMENT  FROM  SIGNS  OF  DIFFERENCES 


a 

> 

b 

-*> 

Sab 

b 

^ 

c 

■*■ 

Sbc 

c 

> 

a 

-> 

Scd 

a  minimal  -  S&b  Scd  v  Sab  Sbc  Scd 


b  minimal  -  Sab  Sbc 


c  minimal  +  Scd  Sbc 


FIGURE  22   SEU  QUEUE  SELECTOR  (cont.) 
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TABLE  30   SEU  QUEUE  SELECTOR  GATE  COUNT 


Unit 

Integrator 

Queue  Counters  Ripple  Carry 
4  Bits 

Difference  Counters 

3  x  1  Switch  (4  bits) 

Adder  for  MQ  -  M, 

Registers  for  Mq,M, 

Registers  for  Sign  Bits 

Pass/No-Pass  Units 


Gate  Count 

Number  Units 

Total 

54 

6 

324 

16 

6 

96 

140 

6 

840 

36 

2 

72 

24 

1 

24 

16 

2 

32 

12 

2 

24 

15 

2 

30 

TOTAL 


1442 
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TABLE  31   CONNECTIONS  FROM  OPERAND  PIPES  TO 
VEU  QUEUE  SELECTOR 


Operand 

Pipe 

Queue  Selector  Port 

0 

0 

1 

1,2 

2 

2,3,4 

3 

3,4,5,6 

4 

4,5,6,7 

5 

5,6,7 

6 

6,7 

7 

7 

The  two-bit  instruction  index  forms  the  high-order  bits  of  the  port  address. 
The  bit  indicating  the  first  or  second  operand  of  binary  instructions  is  the 
low- order  bit. 


Gate  Count 

Unit 

Gate  Count 

Number  Units 

Total 

3  bit  address  decoder 

32 

6 

192 

2x1  4  bit  switch 

24 

2 

48 

3x1  switch 

32 

2 

64 

4x1  switch 

40 

2 

80 

Direct  path 

8 

2 

16 

TOTAL 

400 
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All  operands  of  the  same  type  are  adjacent.  Thus  we  need  only  look  at  the 
type  of  adjacent  instruction  bytes.  Let  A,  B,  C  represent  types  of  adja- 
cent instruction  bytes.  True  =  type  we  are  testing  for.  x  true  indicates 
the  byte  corresponding  to  B  is  the  second  operand  of  a  binary  instruction. 


x  =  AB 


Detecting  a  partition  instruction  is  only  slightly  more  complex.  The  ques- 
tion is,  "Does  a  terminal  byte  exist  before  the  last  byte?"  Tq-T,,  indicate 
whether  the  corresponding  byte  is  terminal.  P.  is  true  if  byte  i  is  part  of 
a  partial  instruction. 


P.  =    A   T. 
1   j-1.12  ] 


FIGURE  23   LOGIC  TO  INDEX  OPERANDS  AND 
DETECT  A  PARTIAL  INSTRUCTION 
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TABLE  32   GATE  COUNTS  FOR  INDEXING  OPERANDS 
AND  PARTIAL  INSTRUCTION  DETECTION 


Indexing  Operands 

Operation 

Gate  Cou 

nt 

Number  Units 

Total 

Operand  type  detection 

5 

12*2 

120 

Index  generation 

3 

12*2 

TOTAL 

72 
192 

Detecting  Partial  Instructions 

12 


Number  of  Gates  =  )>^  (i+1)  =  76 

i=2 
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4.6.3.2.7   VEU  Queue  Selector  (SVE) 

There  are  two  important  differences  between  VEU  and  SEU  queue  selectors. 
First,  this  function  cannot  be  moved  outside  the  main  IUD  pipe,  and  thus  we 
must  allow  for  the  processing  of  up  to  four  instructions  in  parallel.  The 
second  difference  is  the  added  complexity  that  results  from  VEUs  being  as- 
signed to  individual  programs  as  discussed  in  Section  4.6.2.6.  This  requires 
that  the  operands  of  a  vector  instruction  be  combined  with  the  operator  in 
the  unit  which  selects  the  VEU.  We  have  allowed  a  two  minor  clock  delay  for 
this  processing,  but  with  a  fully  pipelined  processing  rate  of  four  instruc- 
tions per  clock.  We  will  first  discuss  how  the  operands  and  operation  get 
together  and  then  the  details  of  the  queue  selection  hardware. 

Since  each  instruction  element  has  an  instruction  index  associated  with 
it,  switching  the  operands  to  the  queue  selection  elements  is  relatively 
straightforward.  We  require  a  switch  from  the  8  vector  operand  pipes  to  the 
4  sets  of  input  ports  for  selecting  a  VEU.  This  does  not  require  a  full  8x8 
crossbar.  Table  31  lists  the  connections  required  and  gives  a  gate  count  for 
this  unit.  One  problem  occurs  with  binary  instructions.  It  is  necessary  that 
each  operand  be  switched  to  a  different  entry  in  the  set  being  used  for  a  par- 
ticular instruction.  To  allow  for  this,  it  would  be  desirable  to  have  associ- 
ated with  each  vector  operand  a  single  bit  indicating  if  this  is  the  first  or 
second  operand  in  a  binary  instruction.  The  same  information  would  be  desir- 
able for  scalar  instructions  when  they  are  recombined  to  be  routed  to  the 
SIDS.  This  information  is  recoverable  from  the  index  of  the  pipe  in  which  the 
operand  occurs,  but  having  it  instantly  available  is  necessary  to  maintain  the 
high  processing  rate.  Figure  23  gives  the  logic  for  this  process,  and  Table 
32  provides  a  gate  count.  This  unit  will  occur  in  the  pipe  front-end  where 
the  instruction  index  generator  occurs  (see  Section  4.6.3.1.4). 
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A  second  problem  results  from  the  fact  that  only  part  of  an  instruction 
may  correspond  to  the  first  or  last  instruction  index.  At  most  four  instruc- 
tions can  complete  in  a  goven  clock.  The  cases  where  four  full  instructions 
enter  the  pipe  in  a  single  clock  allow  space  for  no  partially  complete  in- 
structions. If  we  provide  additional  registers  to  hold  any  initial  segment 
of  an  instruction  with  the  highest  queue  index,  we  can  then  complete  proces- 
sing of  that  instruction  in  the  next  clock.  This  is  the  first  point  in  the 
pipe  where  this  problem  arises.  Thus  we  can  add  switches  and  buffers  to  hold 
a  partially  completed  instruction.  In  order  to  minimize  the  circuitry  to  do 
this,  we  will  provide  one  bit  associated  with  each  instruction  component  to 
indicate  if  it  is  part  of  a  partial  instruction.  The  logic  for  this  is  in- 
cluded in  Figure  23  and  the  gate  count  for  this  logic  is  in  Table  32. 
Figure  24,  which  we  will  discuss  shortly,  gives  the  switch  control  and  buf- 
fers for  retaining  partial  instructions. 

Now  that  we  have  designed  hardware  to  retain  and  collect  the  information 
necessary  for  VEU  queue  selection,  we  can  proceed  to  design  the  hardware  to 
perform  the  algorithms  described  in  Section  4.6.2.6.  There  are  two  ways  in 
which  this  unit  is  more  complex  than  the  SEU  queue  selection  unit.  First, 
it  is  not  simply  the  queue  size  that  is  relevant  to  selecting  a  VEU,  but  also 
the  number  of  operands  already  resident  in  the  queues  in  which  VEU  has  been 
assigned  to  the  instruction.  These  complications  are  handled  by  having  six 
effective  queue  sizes  for  each  VEU.  There  are  effective  queue  sizes  for  the 
following  cases: 

A.  Cases  where  the  VEU  is  assigned  to  this  program. 

1.  No  operands  for  this  instruction  in  this  queue. 

2.  One  operand  for  this  instruction  in  this  queue. 

3.  Two  operands  for  this  instruction  in  this  queue. 
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B.  Cases  where  the  VEU  is  not  assigned  to  this  program. 

1.  No  operands  for  this  instruction  in  this  queue. 

2.  One  operand  for  this  instruction  in  this  queue. 

3.  Two  operands  for  this  instruction  in  this  queue. 

Figure  24  shows  the  entire  design  of  the  VEU  queue  selector.  The  four  groups 
of  six  registers  in  this  figure  contain  the  above  effective  weights  for  each 
of  the  four  VEU  queues. 

The  second  factor  which  makes  this  unit  more  complex  than  the  SEU  queue 
selector  is  the  necessity  of  processing  up  to  four  instructions  in  parallel. 
In  Table  24  we  have  allowed  two  clocks  for  this  processing.  The  problem  with 
meeting  these  time  constraints  results  from  the  way  the  queue  use  of  one  in- 
struction can  affect  queue  assignment  for  later  instructions.  Since  we  are 
processing  up  to  four  instructions  in  parallel,  we  need  to  somehow  simulta- 
neously take  into  account  the  queue  use  interactions  of  four  instructions. 
Although  we  are  allowed  two  minor  clocks  to  do  the  processing,  we  must  pipe- 
line this  with  an  emergence  rate  of  four  instructions  every  minor  clock. 
This  will  have  the  consequence  of  having  to  start  processing  of  a  given  set 
of  four  instructions  before  the  queue  weight  registers  have  been  updated  for 
the  previous  two  sets  of  four  instructions.  Table  33  summarizes  the  depen- 
dency relationships  of  the  instructions. 

We  now  outline  the  algorithms  employed  in  meeting  the  above  constraints. 
Each  of  the  three  major  functions  we  will  list  are  performed  in  parallel  for 
different  sets  of  four  instructions.  The  subfunctions  are  performed  sequen- 
tially on  the  same  group  of  four  instructions. 
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I.  Update  queue  weight  registers. 

1.  Use  a  bit  serial  counter  to  decrement  the  weight  register 
by  1  if  the  corresponding  VEU  has  notified  the  IUD  that  a 
queued  instruction  has  started  to  execute. 

2.  Cascaded  with  the  bit  serial  counter,  use  a  bit  serial  adder 
to  increment  the  corresponding  weight  register  by  the  queue 
use  of  the  instructions  which  had  their  queue  reservation 
processing  completed  in  the  previous  minor  clock. 

II.  First  minor  clock  of  queue  reservation  processing. 

1.  Switch  instructions  and  their  operands  into  Unit. 

2.  Determine  which  weights  to  use. 

3.  Select  the  determined  weights. 

4.  Increment  each  weight  by  the  corresponding  queue  usage  of 
the  queue  reservations  made  in  the  previous  clock. 

5.  Determine  the  minimum  weight.  This  and  the  previous  function 
are  done  simultaneously. 

6.  Subtract  the  minimum  weights  from  all  weights. 

III.  Second  minor  clock  of  queue  reservation  processing. 

1.  Increment  each  weight  by  the  corresponding  queue  usage  of  the 
queue  reservations  completed  in  the  previous  clock.  Simul- 
taneously subtract  0,  1,  2,  3,  and  4  from  each  of  these  sums. 

2.  Select  the  set  of  weights  that  produced  a  zero  sum  and  had 
the  smallest  value  subtracted  from  it. 

3.  Decode  the  selected  weights  as  follows: 

a.  For  instruction  0  generate  BO  which  is  true  if  the 


weight  for  instruction  0  queue  n  is  0. 


192 


b.  For  instruction  1  generate  Bl  similar  to  BO  . 

n  n 

c.  For  instruction  2  generate  B20  and  B21  .  B20  is 

n      n     n 

similar  to  B0n-  B21 n  is  true  if  the  weight  for  instruc- 
tion 2  queue  n  is  1 . 

d.  For  instruction  3  generate  B30  ,  B31  ,  B32  . 

4.  Simultaneously  select  the  queues  for  instructions  0  and  1 
according  to  the  following  algorithms: 

a.  For  instruction  0  select  the  minimum  n  such  that  BO  . 

n 

b.  For  instruction  1  select  the  minimum  n  such  that  Bl 

n 

and  B0n  were  not  selected  for  instruction  0.  If  this 

is  not  possible,  select  the  unique  n  for  which  Bl  . 

^  n 

5.  Select  the  queue  for  instruction  2,  taking  into  account  the 
queues  selected  for  instructions  0  and  1. 

6.  Select  the  queue  for  instruction  3,  taking  into  account  the 
queue  selections  for  the  previous  three  instructions. 

7.  Decode  the  queues  selected  into: 

a.  Binary  integers,  giving  the  queue  use  for  each  queue. 

b.  Queue  addresses  for  the  instructions. 

The  detailed  logic  for  performing  these  functions  is  described  in  Appen- 
dix A.  Most  of  the  logic  equations  in  this  appendix  are  fairly  straight- 
forward. However,  in  order  to  meet  the  serious  time  constraints,  the  logic 
to  perform  functions  1 1 1-4  through  III-6  above  require  a  somewhat  complex 
technique  for  constructing  the  logic  equations.  We  will  now  describe  this 
technique. 

The  method  may  be  thought  of  as  a  generalization  of  the  trick  used  in 
constructing  a  pseudo  carry  adder.  In  general,  we  are  construction  a  func- 
tion Rn  from  many  Boolean  inputs.  We  divide  the  function  into  two  case. 
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We  then  construct  factors  D  and  D  which  will  detect  these  cases.  We  also 
construct  Boolean  selection  functions  SI  and  S2n  which  will  be  true  for 
the  correct  value  of  n  in  cases  1  and  2  respectively.  Thus  we  will  get  an 
equation  for  R  as: 

Rn  -  DSln  V  DS2n 

This  is  essentially  the  way  a  pseudo  carry  adder  is  constructed.  Just  as 
one  can  generate  a  multi -level  pseudo  carry   adder,  we  can  generalize  our 
technique  to  many  levels.  Doing  this  for  the  adder  results  in  very   symmetric 
equations.  In  our  case,  that  symmetry  is  not  present,  and  a  major  part  of 
the  design  problem  is  providing  a  notation  to  keep  track  of  the  terms  we  have 
generated  and  the  cases  we  have  considered.  Thus,  in  the  appendix,  we  have 
indexed  detection  and  selection  terms  as  Di.j.k  and  Si.j.k  where  i,  j,  and 
k  are  integers  equal  to  1  or  2.  Each  integer  separated  by  a  dot  represents 
another  level  of  cases,  and  there  is  no  precise  limit  to  how  many  levels 
are  allowed  other  than  the  necessity  of  keeping  the  equations  to  a  reasonable 
size.  We  do  not  necessarily  go  to  deeper  levels  in  a  symmetric  way.  For 
example,  we  might  develop  S2.1.1  down  three  more  levels  until  we  are  con- 
sidering S2.1.1.i.j.k,  whereas  S2.1.2  may  not  be  developed  to  any  deeper 
level.  This  notation  does  make  it  fairly  easy  to  construct  such  complex 
functions.  We  can  always  determine  the  cases  we  have  not  yet  considered  by 
simply  reading  an  index  backwards  until  we  reach  the  first  1.  The  negation 
of  that  case  is  the  next  one  to  be  considered.  One  other  technique  we  employ 
to  keep  the  equations  from  becoming  too  large  is  to  construct  a  subcase  in  a 
lower  level  of  logic  and  to  simply  use  the  output  from  this  lower  level  in 
the  final  equation.  When  we  do  this,  we  replace  the  corresponding  "."  with 
a  "-".  Thus,  we  might  construct  an  S2.1 .1 n  to  use  in  the  equation  for  S2n< 
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TABLE  33   TIMINGS  FOR  UPDATING  WEIGHT  SELECTION  REGISTERS 


Minor  Clock 

Instruction 
Group  0 
Instruction 
0  12  3 

Instruction 
Group  1 
Instruction 
0  12  3 

Instruction 
Group  2 
Instruction 
0  12  3 

Instruction 
Group  3 
Instruction 
0  12  3 

0 

E  E  E  E 

1 

D1D1D1D1 

E  E  E  E 

2 

D2D2D2D2 

DlWl 

E  E  E  E 

3 

R  R  R  R 

D2D2D2D2 

WlDl 

E  E  E  E 

4 

R  R  R  R 

D2D2D2D2 

WlDl 

5 

R  R  R  R 

D2D2D2D2 

6 

R  R  R  R 

E  =  Enter  Unit;  D, ,D2  =  Determine  Queue  Use;  R  =  Reset  weight  Registers 


Instruction,  Instruction  Group 
(2,0) 

(2,1) 
(2,2) 
(2,3) 
(3,0) 

(3,1) 
(3,2) 
(3,3) 


Requires  information  about  the  follow- 
ing instructions  with  weight  registers 
not  yet  updated 


(1,0)  (1,1) 
(0,0)  (0,1) 

Same  as  (2,0 

Same  as  (2,1 

Same  as  (2,2 

(2,0)  (2,1) 
(1,0)  (1,1) 

Same  as  (3,0 

Same  as  (3,1 

Same  as  (3,2 


1,2)  (1,3) 
0,2)  (0,3) 

plus  (2,0) 

plus  (2,1) 

plus  (2,2) 

2,2)  (2,3) 
1,2)  (1,3) 

plus  (3,0) 

plus  (3,1) 

plus  (3,2) 
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TABLE  34   TIMING  AND  GATE  COUNT  FOR  VEU  QUEUE  SELECTION 


TIMING  FOR  FIRST  CLOCK  OF  PIPELINE 


Logic  Level 

1 

2 

3 

4 

5 

6 

7 

8 

9 
10 
11 


Function  Completed 
Information  switched  into  unit 

Weights  selected 
Weights  switched 


Minimum  of  weights  with  U.  added  found 

J 

Minimum  selected 


Minimum  subtracted  from  all  weights 


TIMING  FOR  SECOND  CLOCK  OF  PIPELINE 


Logic  Level 
1 
2 
3 
4 
5 
6 
7 
8 
9 


Function  Completed 


U.  added,  constant  weights  subtracted 
Group  with  first  overflow  switched 


Queue  use  determined 
Queue  use  decoded 
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TABLE  34   TIMING  AND  GATE  COUNT  FOR  VEU  QUEUE  SELECTION  (cont.) 


GATE  COUNT 

Unit 

Queue  Weight  Registers 

Queue  Weight  Adders  and 

Counters 

Weight  Selection  Logic 

Weight  Switches 

Increment  and  Decode  Loc 

lie 

Second  Increment 

Final  Selection 

Decoding 

TOTAL 


Number  Gates 

120 

2  760 

456 

240 

6  864 

9  608 

895 

64 

21 ,007 
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In  turn,  S2.1.1   could  have  terms  such  as  D2.1 .l-i.j.kS2.1 .1-i  i  k 
n  ,%J  *  n 

in  it.  We  sometimes  use  a  similar  notation  when  values  computed  at  a  lower 
level  of  logic  are  required  but  do  not  exactly  fit  our  detection  selection 
scheme.  In  such  cases,  the  values  are  usually  defined,  e.g.,  T2.1.1  ,  and 
are  simply  used  in  the  equation  for  S2.1.1  . 

This  notational  scheme  does  seem  to  be  an  effective  tool  in  generating 
complex  multi-level  Boolean  functions  where  it  is  important  to  keep  the 
number  of  levels  small.  Unfortunately,  the  scheme  gives  no  algorithms  for 
determining  what  cases  are  likely  to  be  good  ways  to  break  up  the  function. 
Intuition  and  trial  and  error  are  required  for  that  part  of  the  process. 

Table  34  gives  the  overall  timing  and  gate  count  of  the  unit.  These 
figures  are  derived  from  Appendix  A.  These  figures  are  not  unreasonably 
large,  but  they  probably  cound  be  reduced  substantially  by  some  more  playing 
with  the  design.  The  11  levels  of  logic  or  22  gate  delays  in  one  minor 
clock  is  probably  the  one  figure  that  one  would  most  want  to  reduce. 


4.6.3.2.8   Reserve  Vector  Buffer  Storage  (RVS) 

This  unit  must  reserve  storage  for  the  operands  and  results  of  each 
vector  instruction.  In  the  case  of  operands,  the  space  must  be  within  the 
VEU  in  which  the  instruction  is  to  be  executed.  The  same  is  true  of  results 
except  for  results  from  a  memory  load  instruction  which  have  space  in  the 
vector  buffer.  The  operand  and  result  portions  of  this  unit  operate  on  dis- 
joint storage  spaces  and  are  functionally  independent.  We  will  now  describe 
the  structure  and  operation  of  these  two  units. 

Figure  25  gives  the  overall  structure  of  the  result  processing  portion 
of  this  unit.  From  this  point  on  we  will  not  provide  detailed  logical  design 
unless  the  unit  is  in  some  major  way  dissimilar  from  units  already  designed. 
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We  will  make  rough  conservative  estimates  of  gate  counts  and  logic  levels 
required.  In  some  instances  limited  design  of  parts  of  a  unit  may  be  re- 
quired to  verify  that  the  estimates  are  conservative.  We  will  try  to  be 
explicit  about  the  assumptions  on  which  the  estimates  are  based.  Thus  we 
will  describe  how  this  unit  works  and,  within  this  commentary,  provide 
estimates  of  gate  counts  and  timing.  This  same  approach  will  be  used 
throughout  the  remainder  of  Chapter  4. 

The  first  step  in  assigning  storage  for  vector  results  is  to  determine 
how  many  spaces  are  required  in  each  VEU  and  in  the  Vector  Buffer.  This 
function  is  performed  by  the  field  decoder.  We  are  processing  up  to  four 
instructions  in  parallel,  so  up  to  four  results  may  be  required  in  a  given 
VEU.  The  field  decoder  generates  a  count  of  the  number  of  results  required 
of  each  VEU  by  looking  at  the  VEU  address  portion  of  the  result  field  for 
each  vector  instruction.  Decoding  the  address  fields  requires  one  gate  for 
each  bit  in  each  field  or  a  total  12  bits  for  each  VEU.  Thus  the  total  will 
be  under  100.  One  gate  delay  of  1/2  level  of  logic  is  required  to  do  this 
decoding.  The  encoding  of  the  counts  for  individual  VEUs  requires  roughly 
10  gates  for  each  bit  of  the  encoded  results  or  under  300  gates  total.  One 
level  of  logic  is  required  for  this  encoding.  The  counts  produced  are  sent 
to  the  buffer  status  units  and  to  the  final  switch. 

The  vector  buffer  status  unit  differs  from  the  VEU  status  units  mainly 
in  the  size  of  the  memory  it  is  working  with.  The  size  of  the  VEU  result 
memory  is  likely  to  be  about  16  as  discussed  in  Section  4.4.2.1.  The  size 
of  the  Vector  Buffer  is  likely  to  be  about  256  as  discussed  in  Section  4.4.3, 
The  other  differences  between  these  units  is  how  they  free  locations  and  the 
possibility  of  buffer  overflow.  If  the  Vector  Buffer  ever  overflows,  this 
is  a  hardware  or  programming  error  as  discussed  in  Section  3.2.1.2.1. 
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The  Vector  Buffer  status  units  try  to  keep  their  respective  buffers  from 
becoming  full  by  outputting  a  "too  full  signal"  when  their  size  crosses  a 
certain  threshold.  As  discussed  in  Section  3.2.1.2.1,  this  threshold  should 
be  variable  so  that  an  optimal  value  can  be  determined  by  experience.  It 
might  also  be  varied  with  different  programs  or  program  mixes.  We  will 
first  discuss  the  VEU  status  units. 
The  functions  of  these  units  are: 

1.  To  maintain  in  the  buffer  registers  four  available  buffer 
locations. 

2.  To  signal  to  the  VIDS  when  a  buffer  has  exceeded  its  thresh- 
old size. 

3.  To  signal  to  the  entire  IUD  to  pause  when  a  request  is  made 
for  space  that  cannot  be  honored. 

4.  To  process  signals  that  indicate  a  given  VEU  buffer  location 
is  available  for  reuse. 

Since  there  are  only  16  locations  to  be  accounted  for,  it  is  reasonable 
to  maintain  these  in  a  stack  of  registers  that  can  be  shifted  four  positions 
in  a  single  minor  clock.  The  logic  for  such  registers  should  be  less  than 
the  total  number  of  bits  times  12  or  less  than  1000.  The  logic  to  keep 
track  of  the  size  of  the  stack,  to  control  the  shifting,  and  to  interrupt 
the  IUD  should  be  under  200  gates.  If  we  allow  a  new  entry  to  be  made  at 
eyery   fourth  position,  the  logic  for  the  necessary  switch  and  control  should 
be  under  200  gates.  It  is  reasonable  to  allow  at  most  one  location  to  be 
freed  in  a  minor  clock  and  to  buffer  requests  at  their  initiation  whenever 
they  are  generated  at  a  faster  rate.  Thus  the  total  gate  cound  for  one  of 
these  units  will  be  under  1400. 
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Figure  26  gives  the  detailed  structure  of  the  vector  buffer  status 
unit.  A  stack  of  registers  or  push-up  buffer  is  also  employed  in  this  unit. 
This  stack  only  contains  12  of  the  possible  256  available  locations.  A 
single  status  bit  is  maintained  for  each  physical  location,  and  a  register 
is  available  if  its  status  bit  is  one  or  its  address  is  contained  in  the 
push-up  buffer.  The  status  bits  are  grouped  into  four  portions  of  64  each, 
and  these  are  searched  and  set  separately.  The  purpose  of  the  push-up  buf- 
fer is  to  allow  for  no  pauses  in  instruction  processing  in  the  case  when 
one  or  more  of  the  groups  of  64  may  have  no  free  locations.  Experiments 
might  be  desirable  to  obtain  an  ideal  size  for  this  push-up  buffer.  How- 
ever, given  that  we  are  assuming  less  than  half  the  instructions  are  vector 
instructions  and  given  that  this  algorithm  for  assigning  locations  will  tend 
to  distribute  locations  uniformly,  12  is  likely  to  be  an  adequate  size.  The 
size  of  this  buffer  should  also  be  less  than  1000  gates. 

The  clear  tree's  function  is  to  decode  a  6  bit  address  into  a  signal 
to  set  a  status  bit.  This  can  be  done  in  under  160  gates.  The  search  tree 
must  output  a  6  bit  address  corresponding  to  some  bit  that  is  set  and  at  the 
same  time  reset  that  bit.  This  can  be  done  by  having  one  set  of  logic  that 
is  pyramided  up  from  the  status  bits  that  indicates  if  a  given  set  of  four 
and  then  16  status  bits  has  a  one  in  it.  Logic  pyramided  down  to  the  status 
bits  can  choose  the  lowest  set  of  four  at  each  level  and  simultaneously  de- 
code two  bits  of  the  address  of  the  bit  that  will  be  ultimately  chosen  and 
send  a  signal  to  the  correct  one  of  the  four  groups  it  is  looking  at  to 
choose  a  bit  from  that  group.  Only  the  decoding  of  the  least  significant 
two  bits  of  the  address  is  done  at  the  base  of  the  pyramid.  This  requires 
less  than  400  gates.  There  are  256  status  bits.  At  four  gates  each,  this 
comes  to  1024.  The  free  decoder  decodes  the  first  two  bits  of  a  free 
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address  and  switches  the  remainder  to  the  appropriate  clear  tree.  This 
requires  less  than  60  gates.  The  control  for  the  push-up  buffer  requires 
less  than  200  gates.  The  total  gate  count  for  the  vector  buffer  status 
unit  is  under  4600. 

Returning  to  Figure  25,  we  still  need  to  provide  a  gate  count  for  the 
buffer  registers  and  the  final  switch.  The  buffer  registers  require  720 
gates,  and  the  final  switch  and  its  control  require  less  than  400  gates. 
Thus  the  total  for  the  entire  unit  will  be  less  than  14,500. 

4.6.3.2.9   Update  Vector  Tables  and  Fill  in  Vector  Operands  (UV  and  FV0) 

Now  that  we  have  determined  a  physical  address  for  all  vector  results, 
we  need  to  update  the  vector  status  table  discussed  in  Section  4.6.3.2.5. 
In  addition  we  need  to  fill  in  the  vector  operands  which  were  not  known  at 
that  stage  in  the  pipe.  No  additional  logic  is  required  for  the  first  of 
these  functions  since  we  provided  sufficient  address  decoders  when  we  ori- 
ginally discussed  the  vector  status  table.  For  the  missing  operands  we 
have  only  a  time  index  of  the  instruction  which  generates  the  result.  We 
need  to  construct  a  table  which  allows  us  to  map  this  time  index  into  a 
physical  address.  A  simple  way  to  do  this  is  to  provide  a  buffer  with  one 
physical  location  for  each  possible  time  index  of  an  originally  undefined 
operand.  Then,  if  we  load  and  address  this  buffer  in  a  circular  fashion 
and  have  six  independent  ports  to  address  it,  we  will  have  the  problem 
solved.  This  unit  will  be  similar  to  the  non-comparison  tree  portion  of 
the  vector  buffer  comparison  tree  designed  in  Section  4.6.3.2.3.  We  will 
use  the  gate  counts  of  that  unit.  We  do  not  require  the  4  x  36  switch  listed 
there,  and  we  can  get  by  with  four  1  x  9  switches,  thereby  dropping  the 
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gate  count  for  the  switch  to  720.  The  six  address  decoders  require  less 
than  2000  gates.  This  gives  a  total  of  less  than  6500  gates. 

4.6.4   Tail  End  of  Main  IUD  Pipe 

The  main  IUD  pipe  consists  of  those  functions  listed  in  Figure  18.  In 
the  course  of  designing  the  pipe  we  have  decided  to  move  functions  related 
to  scalar  operands  and  results  to  the  SIDS.  We  now  must  complete  the  IUD 
pipe  by  assembling  bytes  into  complete  instructions  and  shipping  these  in- 
structions to  the  VIDS,  SIDS,  or  main  memory  for  further  processing  and 
ultimate  execution.  In  addition  after  the  instructions  are  assembled,  any 
instruction  that  requires  the  vector  switch  must  have  queue  entries  gene- 
rated. 

4.6.4.1   Assembling  Instructions  (AV,  AS,  AM) 

Referring  again  to  Figure  18,  we  see  that  all  the  pipes  except  the 
operator  pipes,  are  separated  into  vector,  scalar,  and  memory  instruction 
bytes.  Thus  the  operators  must  first  enter  a  3  x  1  switch  as  they  emerge. 
From  this  switch,  they  enter  a  buffer  for  either  vector  operators,  scalar 
operators,  or  memory  operators.  Simultaneously  with  making  entries  in  each 
of  these  buffers,  we  will  set  up  a  word  of  presence  bits  to  be  used  in  re- 
moving entries  from  these  buffers.  A  portion  of  this  hardware  is  illustrated 
in  Figure  27.  The  non-operator  pipes  do  not  require  the  initial  switch. 
They  do  require  a  buffer  and  set  of  presence  bits  for  the  process  of  assem- 
bling complete  instructions.  After  these  buffers  another  set  of  switches 
is  required  to  merge  the  bytes  of  an  instruction  into  a  complete  instruction. 
The  size  of  the  buffers  and  switches  is  determined  by  the  various  data  rates 
involved.  We  will  not  consider  the  question  of  what  constitutes  optimal 
size,  but  will  suggest  some  reasonable  sizes.  Both  vector  and  scalar 
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execution  units  can  process  instructions  at  the  rate  of  one  per  minor  clock. 
Since  there  will  be  six  of  each  of  these  units,  the  overall  emergence  rate 
will  average  to  slightly  less  than  one  instruction  per  minor  clock.  Assum- 
ing two  memory  instructions  for  each  vector  instruction  is  probably  conserva- 
tive. To  relate  these  figures  to  the  parameters  we  are  determining,  we  need 
to  consider  instruction  sizes  and  emergence  rates  from  the  pipe.  The  con- 
straints on  instruction  sizes  are  listed  in  Table  18. 

There  will  be  two  levels  of  buffering  involved.  A  sparse  buffer  will 
collect  the  output  as  it  emerges  from  the  IUD.  From  here  the  output  is 
transmitted  to  a  dense  instruction  buffer  from  which  it  will  be  transmitted 
to  its  final  destination.  In  this  section  we  are  determining  the  size  of 
the  first  buffer  and  the  size  of  the  intervening  switch.  The  input  width  of 
this  switch  determines  the  rate  at  which  the  sparse  buffer  is  emptied.  The 
output  width  of  the  switch  determines  the  rate  at  which  complete  instructions 
can  be  assembled.  There  are  three  of  these  switches  operating  in  parallel 
for  each  type  of  instruction.  The  three  switches  are  for  operators,  operands, 
and  results.  It  is  these  parallel  switches  which  reassemble  the  instructions. 
The  output  width  of  these  switches  must  at  least  accommodate  the  average 
instruction  processing  rate.  We  will  use  widths  slightly  larger  than  this 
as  listed  in  Table  35.  The  input  widths  must  be  adequate  to  accommodate  the 
output  widths.  This  can  be  determined  by  consulting  Table  19.  The  size  of 
the  sparse  vector  buffer  should  be  large  enough  to  accommodate  uneven  in- 
struction distribution  without  stopping  the  IUD.  Determining  an  optimal 
value  for  this  size  is  probably  an  impossibility.  A  smart  compiler  could 
probably  do  quite  well  with  very   little  buffering  by  distributing  instruc- 
tions. A  size  between  4  and  8  words  long  would  probably  be  reasonable. 
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TABLE  35   LOGIC  SUMMARY  FOR  ASSEMBLING  INSTRUCTIONS 


Unit 

Switches  for  Operators 

Sparse  Operator  Buffers 

Switch  and  Buffer  Input  Controls 

Vector  and  Scalar  Operand  Buffers 

Buffer  Input  Controls 

Vector  and  Scalar  Result  Buffers 

Buffer  Input  Controls 

Memory  Operand  and  Result  Buffers 

Buffer  Input  Controls 

Vector  and  Scalar  Operator  Switches 

Switch  Controls 

Memory  Operator  Switch 

Switch  Control 

Vector  and  Scalar  Operand  Switches* 

Switch  Controls 

Memory  Operand  Switches* 

Switch  Control 

Vector  and  Scalar  Result  Switches* 

Switch  Controls 

Memory  Result  Switch* 

Switch  Control 


Size 

Number 

Gate  Count 

3x1 

10 

2400 

10x6 

3 

14400 

3 

3720 

6x6 

2 

5760 

2 

408 

4x5 

2 

3840 

2 

272 

7x6 

2 

6720 

2 

476 

4x6 

2 

1440 

2 

240 

6x8 

1 

1920 

1 

160 

4x6 

2 

1440 

2 

240 

6x8 

1 

1920 

1 

160 

2x6 

2 

720 

2 

240 

3x8 

1 

960 

1 

160 

TOTAL 


47,595 


*Attribute  refers  to  instruction  type,  not  operand  or  result  type. 

We  have  assumed  20  bit  words  and  data  paths,  4  gates/bit  storage,  and 
2  gates/switch  bit  junction. 
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Table  35  lists  actual  choices  for  all  the  design  parameters  required  and 
provides  approximate  gate  counts.  We  do  not  do  any  design  of  the  switch 
controls.  They  are  driven  by  instruction  indices  and  instruction  types 
carried  along  with  the  instructions.  Techniques  used  in  previous  sections 
should  easily  produce  units  within  the  specified  gate  estimates  and  timing 
constraints. 

4.6.4.2  Initiating  Transfer  of  Instructions  (IM,  ISC,  IV) 

The  next  function  is  to  ship  the  assembled  instructions  to  the  appro- 
priate units.  One  of  these  destinations  will  be  a  unit  for  generating 
vector  switch  instructions.  We  need  to  estimate  buffer  sizes  and  data  path 
widths  using  the  methods  and  estimates  of  the  previous  section.  These  es- 
timates and  gate  counts  are  summarized  in  Table  36. 

4.6.4.3  Generating  Vector  Switch  Instructions  (GSI) 

Either  memory  or  vector  instructions  may  require  use  of  the  vector 
switch.  We  must  scan  these  instructions  for  vector  operands  and  results 
and,  where  present,  generate  the  appropriate  queue  entries  for  the  vector 
switch.  What  is  required  is  that  the  source  and  destination  for  each  word 
to  be  switched  must  be  selected  from  the  instruction  streams  and  combined 
to  make  a  vector  switch  queue  entry.  Physical  addresses  will  always  be  used, 
In  the  case  of  memory  instructions,  we  must  reserve  space  in  the  vector 
memory  buffer.  Finding  a  space  in  this  buffer  is  simply  a  matter  of  allocat- 
ing a  free  location.  Thus,  simplified  versions  of  the  logic  described  in 
Section  4.6.3.2.8  can  be  used.  The  data  rates  must  be  adequate  to  handle 
the  maximum  rate  at  which  instructions  can  emerge.  Table  37  summarizes 
the  logic  requirements  for  this  function. 
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TABLE  36   LOGIC  FOR  INITIATING  INSTRUCTION  TRANSFERS 


Uml  Size 

Vector  and  Scalar  Instruction  Buffers  6x4 

Memory  Instruction  Buffer  8x4 
Vector  and  Scalar  Buffer  Controls 


Number 

2 

1 

2 
TOTAL 


Gate  Count 
3840 
2560 
400 


6800 


TABLE  37   LOGIC  FOR  GENERATING  VECTOR  SWITCH  INSTRUCTIONS 


Unit 

Select  up  to  2  out  of  32  Available 
Memory  Source  Buffer  Locations 

Select  up  to  2  out  of  32  Available 
Memory  Destination  Buffer  Locations 

Generate  Vector  Switch  Queue  Entries 
for  Memory  Instructions 

Generate  Vector  Switch  Queue  Entries 
for  Vector  Instructions 


Size 


Number    Gate  Count 


400 


300 


800 


400 


TOTAL 


1900 
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4.6.5   Scalar  Instruction  Dispatcher  Subsystem 

The  SIDS  must  provide  time  indexes  and  maintain  use  counts  for  physical 
scalar  addresses.  The  logical  functioning  of  this  unit  is  described  in 
detail  in  Sections  4.3.2.1,  4.6.2.3,  4.6.2.5,  and  4.6.3.2.1.  We  will  sum- 
marize these  descriptions,  provide  an  overall  design  of  the  unit  and  give 
gate  count  estimates.  It  should  be  noted  here  that  the  SPU,  SST,  US,  and 
SSE  functions  listed  in  Section  4.6.3.1.5  are  performed  in  this  unit. 

4.6.5.1  SIDS  Functional  Summary 

The  functions  listed  in  Table  24  are  pipelined  with  an  emergence  rate 
of  one  instruction  per  minor  clock.  They  operate  on  the  scalar  status  table 
and  the  scalar  use  table.  The  scalar  status  table  contains  one  location  for 
each  possible  active  time  index.  It  allows  an  associative  search  to  be  made 
for  the  correct  time  index  of  a  scalar  operand.  Each  new  result  causes  the 
corresponding  time  index  location  to  be  loaded  with  the  physical  address  for 
that  result.  Simultaneously,  an  associative  search  is  made  to  delete  any 
entry  with  the  same  physical  address.  The  scalar  use  table  consists  of  two 
parts.  There  is  a  section  addressable  by  time  indexes  and  another  section 
associatively  addressable  by  physical  address.  This  second  portion  is  for 
scalars  in  use  with  a  time  index  that  is  about  to  be  or  has  been  reused. 
These  are  referred  to  as  the  index  scalar  table  and  old  operand  table. 

4.6.5.2  Detailed  Design  of  SIDS 

We  now  provide  a  detailed  function  of  the  SIDS  structure.  Table  39 
provides  a  description  and  gate  count  for  all  the  tables  we  refer  to.  The 
function  US  consists  of  two  parallel  stores  to  the  scalar  status  table. 
The  function  SPU  merely  retains  a  result  associated  with  its  time  index  to 
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be  searched  in  performing  the  SST  function.  The  SSE  function  selects  the 
scalar  execution  unit.  Since  one  scalar  queue  can  drive  several  equivalent 
SEUs,  this  function  will  ordinarily  be  null  with  instructions  being  routed 
to  the  unique  queue  required.  If  it  is  desired  to  have  independent  queues, 
then  the  logic  of  Section  4.6.3.2.6  can  be  used.  The  SST  function  consists 
of  a  parallel  search  of  scalar  use  tables  for  the  most  recent  reference  to 
the  specified  physical  addresses.  Flow  charts  for  the  functions  UU,  USU, 
and  RU  are  provided  in  Figure  28.  The  AL  function  consists  of  accumulating 
a  list  of  result  locations  and  time  indexes  as  use  counts  with  non-zero  links 
go  to  zero.  These  control  functions  can  all  be  implemented  in  under  10,000 
gates. 


TABLE  38   SIDS  FUNCTIONS 


Function 

Use  Result  to  Update  Scalar  Status 
Table 

Retain  Result  for  Pipelined  Search 
of  Scalar  Status  Table  that  will 
happen  before  this  Entry  is  Complete 

Select  Scalar  Execution  Unit 

Find  Time  Indexes  for  Operands 

Update  Use  Counts  for  Operands 

Update  Scalar  Use  Table  as  Instructions 
are  Executed 


Accumulate  List  of  Time  Indexes  and 
Physical  Locations  Pairs  for  Stores 
that  can  Proceed 


Abbreviation 

Time 

Dependency 

US 

2 

None 

SPU 

1 

None 

SSE 

1 

None 

SST 

2 

SPU 

UU 

2 

SST 

s 

USU 

2 

Instruction 
Execution 

AL 

8 

Continuous 
Function 

Use  Result  to  Update  Scalar  Use  Table 


RU 


None 
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TABLE  39   SPECIFICATIONS  AND  GATE  COUNTS 
FOR  SIDS  TABLES 


I  SCALAR  STATUS  TABLE 
Size 

Fields 

Parallel  Accesses 

Gate  Counts 

II  INDEX  USE  TABLE 
Size 

Fields 


Parallel  Accesses 


Gate  Count 


256  entries 

12  bits  for  physical  scalar  address 

2  associative  reads  (SST) 

1  store  (US) 

256(12*8*2  +  4*12)  =  61440 


256  entries 

12  bits  for  physical  address  (associatively 
addressable) 

2  bits  for  top  and  bottom  list  flag  (associa- 
tively addressable) 

8  bits  for  link 

6  bits  for  use  count 

2  increments  of  use  count  (UU) 

2  decrements  of  use  count  (USU) 

1  associative  read  (RU) 

1  store  (RU) 

1  store  (UU) 

256(14*8  +  8*8  +  6*16  +  4*28)  =  98,304 
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TABLE  39  SPECIFICATIONS  AND  GATE  COUNTS 
FOR  SIDS  TABLES  (cont.) 

Ill   OLD  OPERAND  TABLE 

Size  64 

Fields  12  bits  for  physical  address  (associative! y 

addressable) 

1  bit  for  top  of  list  (associatively  addressable) 
8  bits  for  link  to  index  use  table 
6  bits  for  use  counter 
Parallel  Accesses    1  associative  search  (UR) 

1  set  for  new  result  (UR) 

2  associative  searches  and  increment  counter  (UU) 
2  associative  searches  and  decrement  counter  (USU) 
1  read  of  link  when  use  count  goes  to  zero  (AL) 

1  store  (UR) 
Gate  Count  64(13*8*5  +  6*16  +  12*4  +  4*27)  =  49,408 

Assumptions  used  in  gate  count: 

8  gates  per  bit  per  associative  access 
4  gates  per  bit  per  regular  access 
16  gates  per  counter  bit 
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FIGURE  28   SIDS  FLOWCHARTS  (cont.) 


218 


4.6.6   Vector  Instruction  Dispatcher  Subsystem 

The  VIDS  has  the  responsibility  of  freeing  physical  vector  addresses 
as  soon  as  possible.  The  algorithm  for  doing  this  is  to  maintain  a  use 
count  for  each  logical  buffer  address.  If  a  store  is  processed  going  to  a 
logical  address,  the  corresponding  physical  location  can  be  reused  when  the 
use  count  goes  to  zero.  Use  counts  are  incremented  each  time  an  operand 
appears  in  the  instruction  stream  in  the  VIDS.  They  are  decremented  each 
time  the  Vector  Switch  or  internal  switch  transfers  an  operand.  Since  the 
physical  address  of  each  active  vector  buffer  location  is  unique,  we  may 
organize  the  table  on  this  basis.  Table  40  gives  the  specifications  of  this 
table.  Less  than  4000  gates  will  be  required  for  control  purposes. 

TABLE  40   VIDS  TABLE  SPECIFICATIONS 


Size: 
Fields 


Parallel  Accesses 


Gate  count 


256 

8  bits  for  logical  address 

6  bits  for  use  count 

1  bit  indicating  location  may  be  freed 
when  use  count  is  zero 

1  store  for  new  result  from  instruction  stream 

1  associative  search  on  a  result  from  instruc- 
tion stream 

2  increments  of  use  count  for  each  operand  in 
instruction  stream 

2  decrements  of  use  count  for  each  operand  used 
by  VEUs 

1  decoding  of  physical  address  when  it  becomes 
available 

256(8*8  +  4*15  +  6*16)  +  512  =  56,832 


See  Table  39  for  gate  count  assumptions 
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4.7   GATE  COUNT  SUMMARY 

Table  41  provides  a  summary  gate  count  for  all  the  logic  discussed 
in  this  chapter.  It  is  divided  into  buffer  type  gates  and  other  logic. 
Summaries  are  provided  for  the  computation  portion  of  the  IUD  and  memory 
control.  The  counts  do  not  include  memory  itself,  which  consists  of  one 
million  words  of  64  bits  with  a  1  major  clock  access  rate.  The  gate  counts 
for  buffers  assume  4  gates  per  bit.  We  have  assumed  5000  gates  per  parallel 
computing  element  in  each  VEU.  We  have  assumed  10,000  gates  per  each  SEU. 

TABLE  41   COMPUTATION  UNIT  SUMMARY  GATE  COUNT 


Unit  Gate  Type' 

6  SEUs  1 

Scalar  Status  Tables  1 

Scalar  Buffers  2 

Scalar  Switch  1 

Scalar  Assembling  Unit  1 

6  VEUs  (control)  1 

6  VEUs  (buffers)  3 

6  VEUs  (arithmetic)  1 

Vector  Buffer  3 

Vector  Switch  1 

Memory  Switches  and 

Control  1 


Source  of  Count 
Estimate 
Table  5 
Table  6 
Table  8 
Table  9 
Table  11 
Table  11 
Estimate 
Section  4.4.3 
Table  13 

Table  14 


Count 


60 

000 

16 

050 

655 

360 

88 

880 

25 

000 

87 

936 

480 

000 

240 

000 

131 

072 

252 

544 

992  040 


Gate  Types:  1.  Ordinary  logic. 

2.  Simple  memory  access  rate  1  word  per  2  minor  clocks 

3.  Simple  memory  access  rate  1  word  per  minor  clock. 
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TABLE  41   COMPUTATION  UNIT  SUMMARY  GATE  COUNT  (cont.) 


Unit                  Gate  Type 

Source  of 

Count 

Count 

Remaining  components  are  in  IUD. 

12  Assign  Instruction  No.      1 

Table  21 

1  400 

IUD  Front  End               1 

Table  23 

17  104 

Time  Index  Generator         1 

Table  27 

2  042 

Vector  Buffer  Comparison 

Tree                     1 

Table  28 

23  824 

Vector  Status  Table          1 

Table  29 

76  696 

Ports  to  VEU  Queue 

Selector                  1 

Table  31 

400 

Partial  Instruction 

Detection                 1 

Table  32 

192 

VEU  Queue  Selection          1 

Table  34 

21  007 

Reserve  Vector  Buffer 

Storage                   1 

Section  4. 

6. 

3 

2.8 

14  500 

Update  Vector  Table,  etc.      1 

Section  4. 

6. 

3 

2.9 

6  500 

Assembling  Instructions       1 

Table  35 

47  595 

Initiating  Instruction 

Transfer                  1 

Table  36 

6  800 

Vector  Switch  Instructions     1 

Table  37 

1  900 

SIDS  Control               1 

Section  4. 

6. 

5 

2 

10  000 

SIDS  Tables                1 

Table  39 

209  152 

VI DS                     1 

Table  40 

56  382 

Scalar  and  Computation  Summary:  Type  1,  770,^20;  Type  2,  655,360; 
Type  3,  1,603,112. 

Memory  Summary:  Type  1,  1,851,552;  Type  3,  992,040. 

IUD  Summary:  Type  1,  495,494. 
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5   MACRO  INSTRUCTION  DECODER,  I/O  CONTROL  AND  EXTERNAL  EXPANDABILITY 

The  machine  designed  in  Chapter  4  with  the  addition  of  some  I/O  con- 
trol could  be  a  complete  CPU.   In  this  chapter  we  briefly  discuss  possible 
additions  to  it  that  could  significantly  enhance  its  performance.  The 
Macro  Instruction  Decoder,  as  described  in  Chapter  2,  converts  UAL  instruc- 
tions into  Operand  Fixed  Format  Instructions.  Its  primary  purpose  is  to 
provide  a  high  level  of  flexibility  and  to  help  eliminate  program  non- 
determinism  as  discussed  in  Section  3.2.  These  are  also  the  reasons  for 
including  a  scheme  for  anticipatory  I/O.  We  will  also  describe  the  paging 
algorithms  for  this  machine.  Finally,  we  will  discuss  external  expandabi- 
lity or  the  connection  of  many  of  these  computers  to  form  a  single  working 
unit.  This  chapter  is  an  outline  of  projects  we  would  undertake  if  we  had 
unlimited  time,  energy  and  resources. 
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5.1   MACRO  INSTRUCTION  DECODER 

The  Macro  Instruction  Decoder  may  be  regarded  as  a  combination  inter- 
preter of  UAL  and  operating  system.  Its  primary  function  is  to  convert 
UAL  instructions  to  OFFL  instructions.  Involved  in  this  process  are  the 
following  major  tasks: 

1.  Convert  instructions  operating  on  arbitrary  sized  vectors  to 
operate  on  the  fixed  vector  width  of  the  machine. 

2.  Insure  that  all  memory  accesses  refer  to  pages  present  in 
Primary  Memory. 

3.  Execute  all  transfers  of  control.  In  the  case  of  conditional 
transfers,  attempt  to  cause  any  required  values  to  be  computed 
by  the  execution  units  at  the  earliest  feasible  time.   (The  MIDs 
can  request  values  from  the  EUs  for  use  in  evaluating  condition- 
al transfers.) 

4.  Attempt  to  anticipate  I/O  requests  at  the  earliest  possible 
time. 

5.  Perform  normal  operating  system  functions. 

We  have  described  in  detail  algorithms  for  converting  vector  instructions 
operating  on  arbitrary  sized  vectors  to  instructions  for  a  fixed  vector 
width  [1].  We  will  not  discuss  this  function  further  here.  The  second 
MID  function  may  require  subscript  evaluation.  Whenever  this  occurs,  the 
effect  is  the  same  as  a  conditional  branch.  The  MID  cannot  continue  pro- 
cessing instructions  until  it  can  be  assured  that  the  required  pages  are 
available.  Two  features  should  be  included  to  minimize  problems  associated 
with  this  situation.  First,  both  the  compiler  and  the  MID  should  attempt 
to  insure  that  subscript  expressions  are  evaluated  as  early  as  is  practical. 


223 


The  MID  should  be  constructed  to  use  this  information.  Second,  it  should 
be  possible  to  declare  various  arrays  as  save  core  during  execution  of 
various  program  segments.  If  it  becomes  necessary  to  swap  a  page  of  save 
core,  then  the  entire  associated  program  should  be  swapped  out.  The 
programmer,  the  compiler,  and  possibly  even  the  MID  should  have  the  option 
of  requesting  save  core.  The  last  three  functions  are  all  standard  ones 
with  the  observations  we  have  already  made  about  minimizing  the  effects  of 
non-determinism. 

The  techniques  used  in  Chapter  4  should  allow  one  to  implement  the 
functions  described  in  hardware  in  an  efficient  manner.  A  great  deal  of 
analysis  and  experimentation  would  be  required  to  obtain  a  good  final 
result.  In  Chapter  6  we  will  outline  some  pragmatic  considerations  about 
constructing  the  entire  system.  We  will  conclude  these  remarks  on  the 
MID  by  considering  one  major  issue  that  should  significantly  influence  its 
detailed  design. 

The  MID  performs  many  compiler-like  functions,  and  it  is  an  open  ques- 
tion as  to  which  functions  should  be  performed  by  the  compiler  and  which 
by  the  MID.  The  primary  motivation  for  moving  compiler  functions  to  the 
MID  is  the  existence  of  information  at  execution  time  that  is  not  available 
at  compile  time.  The  primary  drawback  is  that  the  functions  must  be  per- 
formed in  an  interpretive  manner  whenever  a  given  code  segment  is  executed. 
To  the  degree  that  it  is  possible  to  do  the  analysis  fast  enough  with  logic 
that  is  significantly  less  costly  than  the  "computing  portion"  of  the 
machine,  this  is  not  a  major  drawback.  The  only  way  to  get  a  good  hold  on 
what  the  tradeoffs  are  is  to  do  some  experimentation.  We  do  not  yet  have 
all  the  techniques  required  to  design  a  good  MID  as  outlined  above.  Once  we 
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have  generated  some  basic  set  of  building  block  ICs  and  have  experience 
with  connecting  them,  similar  to  the  experience  we  now  have  in  construct- 
ing large  compilers,  I  would  anticipate  this  approach  to  be  highly  pro- 
ductive. 
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5.2   PAGING  DESCRIPTION 

In  this  section  we  outline  some  minimum  requirements  for  a  paging 
algorithm  to  function  with  the  machine  already  described  and  describe  some 
information  that  the  MID  could  make  available  and  that  would  be  of  use  to 
an  intelligent  memory  manager.  One  essential  requirement  is  that  a  page 
be  locked  if  any  instructions  accessing  it  has  gotten  past  the  MID.  A 
locked  page  cannot  be  transferred  to  back  up  memory  until  all  pending 
requests  for  access  from  the  EUs  have  completed.  Another  required  page 
status  is  that  it  be  saved.  A  saved  page  is  one  considered  essential  to 
the  current  reasonable  execution  of  a  particular  program  and  cannot  be 
swapped  unless  the  entire  program  is  swapped.  Additionally,  the  MID  can 
look  ahead  and  anticipate  what  pages  are  about  to  be  accessed.  Thus,  an 
additional  state  a  page  can  be  in  is  that  of  about  to  be  required.  It 
should  be  possible  for  the  MID  to  provide  a  rough  estimate  of  how  imminent 
the  access  is  as  a  basis  for  determining  priorities. 
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5.3   EXTERNAL  EXPANDABILITY 

The  remarks  in  this  section  will  be  primarily  philosophical.  They 
might  be  thought  of  as  an  expansion  on  the  ideas  that  led  to  the  two- 
level  clock  discussed  in  Section  3.1.3.1.  The  fundamental  concept  is  that 
the  physical  size  of  a  computing  structure  imposes  constraints  on  the 
interface  and  control  structures  of  subunits.  The  primary  constraint  is 
that  the  larger  the  physical  size,  the  longer  the  delays  that  must  be 
tolerated.  A  secondary  constraint  is  that  the  amount  of  information  passed 
between  subunits  should  be  kept  reasonably  small.  The  interface  scheme 
for  two  clocks  at  different  structural  levels  could  be  generalized  to  more 
levels.  An  especially  serious  constraint  that  is  related  to  the  discus- 
sions on  non-determinism  in  the  previous  section  is  that  of  the  control 
structure.  Traditional  computers  have  a  hierarchical  control  structure. 
The  operating  system  resident  in  the  CPU  controls  the  entire  computing 
system.  Some  units  like  I/O  channels  may  have  a  limited  degree  of  auto- 
nomy. Computer  networks  like  the  ARPA  net  have  a  democratic  structure. 
There  is  no  central  source  of  control.  The  larger  the  physical  size  of 
a  computing  unit,  the  more  desirable  a  democratic  structure  becomes.  If 
one  wished  to  use  a  large  computing  system  for  a  single  problem  and  it 
possessed  a  democratic  structure,  then  one's  program  would  need  to  reflect 
that  structure.  Basically,  what  is  required  is  fork  and  join  operations 
and  the  ability  for  independent  processes  to  interrupt  and  in  other  ways 
communicate  with  each  other.  There  do  exist  computing  languages  with 
these  features.  They  are  primarily  used  in  real-time  computing  systems. 
Our  computer  structure  with  its  operating  system  computer,  the  MID,  inde- 
pendent of  the  number  crunching  part  of  the  machine  could  provide  an  excel- 
lent candidate  for  a  democratically  structured  computing  system. 
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6   CONCLUSION 

To  perform  a  detailed  and  complete  analysis  of  the  structure  we  have 
designed  would  require  an  extremely  elaborate  and  costly  computer  simu- 
lation. Such  a  process  could  provide  much  information  about  ironing  out 
details,  refining  design  parameters,  and  in  general  improving  implementa- 
tion details.  Such  a  process  is  not  necessary  to  provide  general  estimates 
of  the  performance  of  this  structure.  For  this  purpose  we  can  use  the 
generalized  measures  on  FORTRAN  programs  which  have  been  experimentally 
obtained.  We  justify  this  approach  as  being  a  useful  and  meaningful  first 
iteration  in  the  process  of  developing  the  design  techniques  and  structural 
approach  we  have  adopted. 

The  basic  postulate  is  that  this  structure  can  obtain  the  potential 
speed-up  and  efficiency  that  has  been  measured  in  FORTRAN  programs.  We 
justify  this  statement  by  the  flow  analysis  that  we  have  provided  throughout 
Chapter  4  and  by  the  structure  of  the  arithmetic  units.  The  effective 
width  of  our  machine  is  38.  This  includes  four  8-word  wide  parallel  units 
and  six  scalar  arithmetic  units.  Although  some  of  the  FORTRAN  programs 
could  benefit  from  a  wider  machine,  most  used  roughly  this  amount  or  less 
parallelism.  The  multiprogramming  structure  of  the  machine  allows  the 
entire  machine  to  be  fully  utilized  while  executing  individual  programs 
that  could  not  effectively  utilize  it.  Our  hardware  based  real-time  sched- 
uling will  allow  less  compile  time  analysis  and  allows  non-deterministic 
breaks  if  they  are  sparse  enough  to  occur  without  degrading  utilization. 

We  did  not  start  out  to  design  a  machine  to  accommodate  arbitrary 
FORTRAN  programs  of  the  type  measured  and  would  not  propose  that  the  machine 
be  devoted  to  an  essentially  random  mix  of  FORTRAN  programs.  The  most 
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cost  effective  way  to  execute  a  small  FORTRAN  program  is  to  find  the 
smallest  available  minicomputer  that  will  accommodate  it.  Showing  that 
such  a  structure  can  be  effectively  utilized  on  such  a  random  mix  of  jobs 
guarantees  that  it  works  for  the  worst  cases  that  it  is  likely  to  encounter. 
Our  original  aim  was  to  design  a  good,  flexible,  easy-to-program  parallel 
computer  based  to  a  large  degree  on  our  experience  and  intuition  obtained 
from  working  with  ILLIAC  IV  and  thinking  about  other  parallel  computers. 
Thus,  for  example,  our  independent  vector  and  scalar  execution  units 
evolved  directly  from  problems  in  programming  ILLIAC.  Just  as  the  FORTRAN 
program  measurements  were  intended  as  a  sort  of  benchmark  establishing  a 
minimum  degree  of  utilizable  parallelism  over  a  broad  class  of  problems, 
we  use  them  here  as  a  minimum  benchmark  of  this  machine's  performance. 

One  objective  of  our  work  is  simply  not  measurable.  That  is  to 
provide  a  machine  that  is  easy  to  program.  It  is  my  belief  that  one  of  the 
major  difficulties  in  using  current  parallel  machines  effectively  is  that 
few  people  understand  how  to  program  them.  Our  primary  inspiration  for 
this  process  was  the  B5500  machines  and  their  use  of  hardware  to  handle 
many  of  the  tedious  details  of  programming  and  to  do  so  in  an  execution 
time  dynamic  way.  The  ultimate  measurement  of  the  value  of  that  approach 
as  it  was  applied  in  those  machines  was  the  economic  success  of  a  machine 
that,  if  it  were  rated  on  a  multiplies  per  dollar  basis,  would  come  out 
very   poorly.  Because  the  problems  of  programming  parallel  machines  are 
significantly  more  complex,  such  hardware  aids  seem  to  us  to  be  even  more 
desirable  for  them. 

In  summary,  the  design  can  effectively  exploit  the  parallelism  that  has 
been  measured  in  a  broad  class  of  problems.  It  has  a  great  many  features 
that  should  significantly  ease  the  burden  of  exploiting  parallelism. 


229 


SPECIFIC  RESULTS 

The  results  of  this  work  is  not  a  detailed  plan  for  constructing  a 
computer,  but  rather  the  development  of  a  general  approach  and  techniques 
for  implementing  that  approach.  The  detailed  design  work  and  its  relation- 
ship to  measures  on  FORTRAN  programs  is  intended  as  a  justification  that 
the  approach  and  techniques  are  practical  and  effective.  Here  we  will 
sort  out  which  of  our  techniques  and  approaches  appear  particularly  suc- 
cessful and  which  areas  call  for  additional  study. 

The  generalization  of  the  technique  for  designing  a  carry  look-ahead 
adder  seems  to  be  a  useful  technique  for  designing  fast,  complex  combina- 
torial circuits.  This  is  the  technique  described  in  Section  4.6.3.2.7. 
One  area  where  this  technique  might  be  productively  employed  is  in  provid- 
ing real-time  dynamic  control  for  a  multi-level  crossbar  switch  as  defined 
in  [4J.  This  unit  allows  the  arbitrary  permutation  of  a  vector  in  an 
extremely  cost  effective  way,  but  it  requires  a  highly  complex  scheduling 
algorithm.  If  one  could  construct  a  relatively  inexpensive  combinatorial 
circuit  to  schedule  such  a  network,  one  would  probably  have  the  ideal 
crossbar  switch  for  large  applications.  This  scheduling  problem  may  well 
be  suited  to  the  type  of  analysis  we  developed.  One  could  begin  by  design- 
ing the  logic  for  a  two-level  4x4  crossbar,  then  gradually  move  up  to  higher 
levels.  The  analysis  technique  would  certainly  provide  a  reasonable  hard- 
ware scheduling  algorithm  for  the  initial  small  switches  and  the  intuition 
developed  might  well  lead  to  generalizations  valid  for  larger  switches. 

Most  of  the  logic  design  we  have  done  is  certainly  far  from  optimal. 
Our  circuitry  for  very   fast  conflict  resolution  may  be  an  exception.  It 
requires  few  gates,  is  extremely  fast,  and  we  have  proven  that  it  can  be 
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generalized  to  an  arbitrary  number  of  units  in  possible  conflict.  We 
have  used  it  in  a  great  many  situations  in  our  overall  design.  This  unit 
is  described  in  Section  4.4.2.4. 

The  observations  about  block  structure  and  universal  building  blocks 
in  Section  3.1  seem  to  us  to  be  particularly  significant  and  an  area  re- 
quiring much  further  development.  Certainly  designing  a  set  of  basic 
building  block  ICs  is  a  problem  of  major  significance  for  the  super- 
computers of  the  future.  We  have  made  some  very   preliminary  steps  in  that 
direction. 

The  concept  of  instruction  level  multiprogramming  seems  to  be  useful 
in  the  environment  we  have  employed  it.  The  advent  of  cheap  mini  and 
microcomputers  has  certainly  greatly  reduced  the  need  for  multiprogramming. 
It  does  seem  to  us  to  be  an  important  feature  for  very   large  parallel  com- 
puters for  two  reasons.  First,  a  great  many  runs  on  such  computers  will 
be  short  debugging  runs.  The  availability  of  the  machine  for  such  pur- 
poses can  be  a  very   critical  factor  in  program  development  time.  Multi- 
programming can  allow  short  high-priority  jobs  to  be  run  while  longer 
production  jobs  are  also  using  the  machine.  The  second  reason  is  provid- 
ing two  independent  processes  may  be  an  effective  way  to  program  some  large 
tasks.  Providing  hardware  to  execute  these  is  desirable.  Instruction 
level  multiprogramming  is  particularly  nice  in  that  there  is  no  overhead 
involved  in  swapping  out  programs  from  registers.  The  operating  system 
control  resides  in  a  processor  entirely  independent  from  the  various 
arithmetic  units,  and  as  long  as  any  MID  is  feeding  them  instructions, 
they  can  be  utilized  efficiently. 


231 


LIST  OF  REFERENCES 


1  Budnik,  P.   P.,   "Tranquil  Arithmetic,"  M.S.   Thesis,  University  of 
Illinois,   1969. 

2  Budnik,  P.   P.,   "An  Intuitive  Interpretation  ofthe  Hyperarithmetic 
Sets,"  talk  presented  at  the  Spring  1972  meeting  of  the  Association 
for  Symbolic  Logic,  abstract  printed  in  the  Journal   of  Symbolic 
Logic,  volume  37,  number  4,  p.   778,  December  1972. 

3  Davis,  E.   W. ,  "A  Multiprocessor  for  Simulation  Applications," 

Ph.D.  Thesis,  University  of  Illinois  at  Urbana-Champaign,  Department 
of  Computer  Science  Report  No.   527,  June  1972. 

4  Kuck,  D.  J.,  D.   H.   Lawrie,  and  Y.   Muraoka,   "Interconnection  Networks 
for  Processors  and  Memories  in  Large  Systems,"  COMPCON  72  Digest  of 
Papers,  pp.   131-134. 

5  Kuck,  D.  J.,  Y.   Muraoka,  and  S.   C.   Chen,  "On  the  Number  of  Operations 
Simultaneously  Executable  in  FORTRAN-Like  Programs  and  Their  Result- 
ing Speedups,"   IEEE  Trans,   on  Computers,  volume  21,  number  12, 
December  1972,  pp.    1293-1310. 

6  Muraoka,  Y. ,  "Parallelism  Exposure  and  Exploitation  in  Programs," 
Ph.D.  Thesis,  University  of  Illinois  at  Urbana-Champaign,  Department 
of  Computer  Science  Report  No.   424,  1971. 

7  Rogers,  H. ,  Theory  of  Recursive  Functions  and  Effective  Computability , 
McGraw  Hill,   1967.  ~~~  

8  Tomasulo,  R.   M. ,   "An  Efficient  Algorithm  for  Exploiting  Multiple 
Arithmetic  Units,"   IBM  Journal  of  Res,   and  Devel . ,  volume  11,  number  1 
January  1967. 

9  Turn,  Rein,  Computers  in  the  1980s,  Columbia  University  Press,  1974. 


232 


APPENDIX  A 
DETAILED  LOGIC  FOR  VECTOR  EXECUTION  UNIT  SELECTOR 

This  appendix  describes  in  detail  the  unit  outlined  in  Section 
4.6.3.2.7.  Some  notational  conventions  and  structure  of  this  appendix 
are  explained  in  Section  4.6.3.2.7. 

The  following  conventions  will  be  observed  throughout  this  appendix: 

1.  Superscript  i  ranges  over  (0,1,2,3)  and  refers  to  an  instruction 

2.  Superscript  n  ranges  over  (0,1,2,3)  and  refers  to  a  VEU. 

3.  Superscripts  or  subscripts  may  be  omitted  when  they  are  uniform 
and  unambiguous  throughout  an  equation. 

4.  All  weight  registers  are  5  bits  wide. 

5.  Value  (X..)  indicates  the  value  of  the  binary  integer  defined 
by  Boolean  value  XQiX1 ,. .. ,Xj4  Xq  has  highest  significance. 

In  addition,  the  following  notation  will  be  used: 

Uj  5  j   bit  of  amount  added  to  the  size  of  the  queue  for  VEU  n. 
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A-l   QUEUE  WEIGHT  REGISTERS  AND  ADDERS 

Inputs 

Un 
J 

d   =  indicates  queue  n  is  to  be  decremented  by  1. 
Function 

(U. )  -  (d  )  must  be  added  to  contents  of  a  queue  weight  register, 

a.j  I  =  (0,1,..., 5)  and  ranges  over  the  6  queue  weights  for  a 
single  queue. 

Algorithm 

We  will  use  a  serial  counter  and  a  serial  adder  cascaded  together. 

Equations 

4-£n      •  j  •  j.   .  th  i  . , 

tj      indicates  j   bit  of  counter  output 

ct.     indicates  carry  from  j   place  of  counter 

ar.     new  value  of  aln 

Cj      indicates  carry  from  j   bit  in  producing  output 

*o        =    ao  d  v  ao  d 
co        =    aod 


*J  =     aj    Cj-1    Vaj   Cj-1  U"   ]'2'3) 

cj      =    aj  cj-i  U  =  ]'2'3) 


ar        =     t    a     v  t     a 

o  oooo 
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o 
ar. 


t    a 
o     o 

*J  aj  Cj-1  V  Fj  *J  Cj-1   V  Tj  aj  «j-l  v  *J  *j  'j-l     0  "  1-2,3) 

*j  8J  V  *J  CM  V  aj  CJ-1  0-1.2.3) 


Logic  Levels 
6 


Gates 


115  (These  are  just  standard  counter  and  adder  circuits.  We  include 
these  as  an  example  of  our  notation  and  because  we  wish  to  design  this 
unit  in  complete  detail.) 
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A- 2   WEIGHT  SELECTION  LOGIC 
Inputs 


,ik 


indicates  instruction  i  operand  k  is  unknown  (k  =  0,1). 


ik 
X.      bit  j  of  queue  address  where  operand  of  instruction  k 


J 


resided,   (j  =  0,1,2,3)  We  will  assume  (Xgk  v  x]k)  implies 


operand  not  assigned  to  one  of  these  four  VEUs. 
A.      bit  j  of  VEU  assigned  to  instruction  i. 


Outputs 


W«      indicates  if  weight  I  for  instruction  i  queue  n  is  to  be 
switched.  I   has  the  following  meanings: 


Number  of  Operan 
from  Instruction 
in  Queue  n 

ds 
i 

Instructi 
Assigned 
VEU  n 

on  i 

to 

Instruction  i 
not  Assigned 
to  VEU  n 

0 

Win 
w0 

win 

w3 

1 

wjn 

Win 
W4 

2 

wjn 

Win 
w5 

An  unknown  operand  will  count  as  being  in  the  assigned  VEU, 


Algorithm 


First  we  generate: 

P^      which  indicates  if  operand  k  from  instruction  i  is  in 

queue  n. 
AP^     same  as  P^  but  will  include  an  unknown  operand  as  being 

assigned  queue  n. 
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AA 


in 


indicates   if  value  (A1.)  =  n. 

J 


We  will  then  use  these  to  generate  wln  in  a  fairly  obvious 


way 


PJn   =  [value  (x{°)  =  nj 


pi0 

= 

Yi0  Yi0  Yi0  Yi0 

Aj-i     A-i     hry           A^ 

Pil 

= 

x^°  x{°  xi0  xi0 

pi2 

= 

Yi0  Yi0  Yi0  vi0 
A0  Al  A2  X3 

PT3 

= 

Yi0  Yi0  Yi0  vi0 
A0  Al  A2  x3 

pin 
Pl 

= 

[value  (Xf )  =  n] 

The  expansion  is  similar  to  the  above, 


AP 

AP 
AA 

W 
W 

w 
w 
w 


in 
0 

in 

1 

in 


=  [value  (xj°)  =  n]  v  Yi0 
=  [value  (X^1)  =  nj  v  Y11 

•J 

=  [value  (A?)  =  nj 


in 
0 

in 

1 

in 
2 

in 
3 

in 


AAin  AP^n  APJn 


in  nnin  nnin 


AA'"  AP^"  APJ"  v  AAin  APjn  AP] 


AAin  APJn  APJn 


AAln  APjn  AP]n 


in     nnin     nnin 


4"       =     AA'"     AP-'     AP^1     v     AAin     AP™     APJ 
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W™       =     AAin     APjn     AP]n 


Logic  Levels 
2 

Gates 
114 
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A- 3   INCREMENT  AND  MIN  SELECTOR  DETAILS 

Inputs 

a.      bit  j  of  queue  weight  from  weight  selection  switch 
J 

Un 
J 

Functions 

1.  Subtract  U1?  from  each  of  4  weights. 

2.  Find  minimum  weight. 

3.  Subtract  minimum  weight  from  all  weights. 

Functions  1  and  2  are  combined  in  one  set  of  equations.  We  will 
discuss  these  functions  first. 

Algorithm 

We  will  break  up  the  operation  into  parts  by  computing  the  following 
intermediate  values: 

bjn     jth  bit  of  value  (uin)  +  value  (a1n) 
J  J  j 

c.  carry  from  bit  j  used  in  computing  b"!n. 

ASZ™  indicates  that  value  (ajn,ajn)  s  k. 

BSZ™  indicates  that  value  ( b^n ,b^ n ,b|n )  <;  k. 

CSZj,n  indicates  that  value  (bjn,bjn)  *  k. 


SBin    b]n     v  b\n 
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M^n  indicates  that  counting  only  bits  0  through  j 

bj     is  a  minimum  over  n  =  (0,1,2,3). 

:™        indicates  that  M™    A  [value  (b^.b^11)  <  £]. 


MCZ 


Time  versus  variable  computed 


Logic  Level  Variables 


0  U      a™ 

J  J 


1  rlf1    h111     r111    AC7ln    Kin 

1  c4     b4     c2     ASZk     b2 


2  b]n  b™  BSZ™  c]n  SBin 


o  bg     CSZ.      Mp     MCZ. 


Equations,  Level  =  1 


C4    =  U4  a4 


Hj" 


b4    =  U4  a4  v  U4  a4 


c2    =  U2  a2  v  a2  U3  a3  v  a2  a3  a4  U4  v  a2  U3  a4  U4 
The  above  uses  value  (U.)  <  4. 

J 

ASZg  =  aQ  a1 


ASZ,  =  a0 


ASZ2  =  aQ  v  a. 
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ASZ3     =     TRUE 

b2        =     a2  U2IJUJ    v     a2U^U^U^    v     a£  U^  UJ  i^    v 

a2  °2  *3  ^    v     a2  ^2  *3  H     v     «i  U2     v 

*2  U3  a3    v  ^2  U3  U4  a4  v  ^2  a3  U4  a4 
Equations,  Logic  Level  =  2 


>!    =  a1  c2  v  3]  c2 


b3    =  a3  c4  U3  v  a3  c4  U3  v  a3  c4  U3  v  a3  c4  U3 
BSZQ  =  ASZq^E^ 


BSZ1  =  ASZQ  C2 


BSZ2  =  ASZQ  v  ASZ]  c2  b£ 
BSZ3  =  ASZQ  v  ASZ]  T2 


BSZ4  =  ASZ1  v  ASZ2  c2  b^" 


BSZ5  =  ASZ1  v  ASZ2  T2 


BSZ6  =  ASZ2  v  b2 


BSZ?  =  TRUE 


Cn       ~    d-l   Cn 


SB 


b4  v  a3  c4  U3  v  a3  c4  U3  v  a3  c4  U3  v  a3  c4  U^ 
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Equations,  Level  =  3 


b0         =     Cl     v     b0 


CSZ.     =     same  as  ASZ.    but  b3,b.  replace  aQ,a.. 


.in 


BSZQ       v 


6 

Z 
k=l 


BSZ™       tt     BSZ^ 

m=0 
m^n 


MCZ 


in  .     .in  . in  DC7in 
0     "     b3     b4     BSZ0         v 


b3     b4 


6 

Z 

k=l 


BSZ?n     TT 
K     m=0 
m^n 


BSZ 


lm 
k-1 


MCZ1/  =     bin  BSZ1"     v    57    i 
J  u  J  k=l 


Dc7in  Dcvlm 

BSZk       tt     BSZk_1 

m=0 

m^n 


MCZ 


1  =     (bin  v  bjn)  BSZin     v     (b/  v  bjn)     Z 

k=l 


BSZ1/     tt     BSZ^1 
m=0 
m^n 


SB'"  BSZ'"     v     SBm     Z 
U  k=l 


BSZ™     tt     BZK™ 
m=0 
ntfn 


MCZ/  =     M™ 
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Equations,  Level   =  4 


Min      -     Min  rc7in 
M4         =     M«     CSZQ       v 


M?n  CSZ}n    tt     (M™  v  CSZ^"1)     v 

1    m=o     ^  u 


3 


m=0 


m?"  csz:"   n    (m:,m  v  CSZ  '"  )    v 

c     m=0       ^  ' 


M^n 

7r     (M?m  v  CSZJm) 

m=Q       c              *" 

m^n 

=     M™ 

cszjn 

V 

M*n 

cszjn 

tt     MCZ™     V 
m=0        u 
m^n 

4" 

csz™ 

3      - 

7T          f 

m=0 

3    1- 

7T     MCZ™     v 

m=0         ' 

m^n 

^n 

1CZ^m 

We  still   need  to  perform  the  subtraction  of  the  minimum  from  all 
weights. 

Algorithm 

1.         Select  the  minimum  using  the  Mln. 


2.         Do  a  subtract  of  mini 


mum. 
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Equations 


i  n 

d.      bit  j  of  minimum 

J 


d«  .  b!"M» 


We  wish  to  do  the  subtraction  in  two  logic  levels 


a.  bit  j  of  result 

c.  carry  from  j       bit 


a4         =     b4  d4       v     b4  d4 


c4        =     b4d4 


c3        =     b3  d3     v     b3  b4  d4 


a3m            in  .  m 

aa.  a.  assuming  ci 

J               J  3  3 

ac.  aln  assuming  cln 


aaQ      =     bQ  ^  b1  tff    v     bQ  ^  b1   b2     v     bQ  ^  b1  d2 


v 


bQ  o^  07  b2     V     bQ  ^  ^  ^ 

acQ   =  bQ  3q  b,  H7  v  bQ  ^  b1  b2  g^  v  bQ  d^  37  b2  d^ 

The  above  two  equations  take  advantage  of  the  fact  that  we  are  sub- 
tracting the  minimum  and  no  negative  result  is  possible. 


244 


aal   =  bl  dl  b2  v  bl  dl  d2  v  bl  dl  d2  v  bl  dl  b2  v 

B7  b^  d2  v  b1  d1  K,   d2 

ac,   =  b]  dj  b2  d^  v  Fj"  d]  b2  d^  v  b^  ^j"  b^  v 

'  ^7^7  d2  v  bl  dl  h     v  bl  dl  d2 

aa2   =  b2  d^  v  F^  d2 

ac2   =  b2  ^2  V  b2  d2 

a3    =  b3  ^3  ^3  v  ^3  d3  ^3  v  S  ^3  C3  v  b3  d3  c3 

a2    =  aa2  c^  v  ac2  c~ 


a-j    =  aa,  c3  v  ac,  c^ 


aQ    =  aaQ  c3  v  acQ  c3 


Logic  Levels 

4  Increment  and  Select  Min 

1  Switch  Min 

2  Subtract  Min 
7  Total 

Gates 

4976  Increment  and  Select  Min 

160  Switch  Min 

1728  Subtract  Min 
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A- 4   INCREMENT  AND  DECODER  DETAILS 
Inputs 


A1.      finally  computed  in  Section  A-3 


u? 


Function 

We  need  to  perform  the  following  three  functions: 

1.  Compute  value  (Bln)  =  [value  (a1.")  -  value  (u")]. 

J  J  J 

2.  Normalize  the  result  so  the  smallest  is  O.an.. 

J 

3.  Compute  Bim  =  [value  (an1.11)  =  m]. 

(i,m)  =  (0,0),  (1,0),  (2,0),  (2,1),  (3,0),  (3,1),  (3,2) 

B00„  and  BIO*  will  also  be  written  as  BO  and  Bl  . 
n       n  n      n 

Algorithm 

Taking  advantage  of  value  (u!?)  £  4,  we  will  compute  b!n  in  two  levels 

j  j 

1i 
j 


i  n/ 
of  logic.  To  do  the  normalization  fast,  we  will  actually  compute  b 


where 

[value  (b1^)]  =  [value  (b!n)  -  I] 
j  j 

I     =  (0,1,2,3,4) 
We  will  detect  the  smallest  I   for  which  an  overflow  occurs  and  switch  this 
set  b.   as  an.  .  We  will  then  generate  the  Bim  in  the  obvious  way  in 

J  J 

one  clock. 
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Equations 


irv£ 


The  equations  for  computing  the  bln^  are  similar  to  those  for  sub- 
tracting the  minimum  in  Section  A-3,  and  we  will  not  describe  them  in 
detail.  The  switch  is  also  standard,  so  we  will  not  describe  it  either. 


Gates 


8640    Subtract  u]n   and  offsets 

800    Switch  correct  offset 

168    Compute  Bim 

n 

9608    Total 
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A- 4   FINAL  SELECTION  UNIT 

Inputs 

Bim  generated,  described  in  Section  A-3. 

Functions 

Select  the  minimum  weights  for  up  to  four  instructions,  taking  into 
account  the  fact  that  the  queue  selected  by  instruction  i  must  have  one 
added  to  it  before  determining  the  queue  for  instruction  i+1. 

Algorithm 

Output:  Ri  which  indicates  instruction  i  is  to  use  queue  n. 

R0„     is  set  true  for  minimum  n  such  that  BO  is  true, 
n  n 


Rln     is  set  true  for  minimum  n  such  that  RO  Bin.  If  there  is 
n  n 

no  such  n,  then  the  Rl  is  set  true  for  the  unique  n  such 

that  Bl  . 
n 

R2n     is  set  depending  on  the  following: 


1.  If  3n[B20n  (R0n  v  Rln)]  then  the  minimum  n  satisfying 


B20n  R0n  Rl  is  chosen.  If  none  exists,  then  minimum 

n  such  that  B20  is  chosen, 
n 

2.  If  Vn[B20n  ■+  R0n  Rl  ]  then  the  minimum  n  such  that 

B21n  R0n  is  chosen.  If  none  exists,  the  unique  n  such 

that  B20„  is  chosen, 
n 

R3      is  set  depending  on  the  following  conditions: 


1.  3n[B30n  (R0p  Rln  v  R1n  R2n  v  R0n  R2n)J.  The  minimum  n 


such  that  B30n  R0n  Rln  R2n  is  chosen.  If  none  exists, 
then  the  minimum  n  such  that  B30  is  chosen. 
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2.  Vn[B30n  +  (ROn  Rln  R2n  v  ROn  R1r  R2n  v  RO^  Rln  R2,,)]. 
Note  both  this  and  the  next  condition  imply  B30  is 


un 


ique.     Also,  B30.  +  (B31.   v  B32.).     The  mini- 


mum n 


such  that  B31n  R0n  RT^  R2n  is  chosen.  If  none  exists, 
then  the  unique  RO  is  chosen. 
3.  Vn[B30n  +  R0n  Rlp  R2J.  The  minimum  n  such  that  B31 
is  chosen.  If  none  exists,  then  the  minimum  n  such 
that  B32n  is  chosen.  If  none  exists,  the  unique  n 
such  that  B30n  is  chosen. 

Equations 

For  a  description  of  the  notation  and  conventions  used  in  generating 
these  equations,  see  Section  4.6.3.2.6. 


0     RO 
n 

Detection: 

Only  one  case. 
Selection: 


n-1 


R0n     =     B0n  A    *     BO 
J=0       J 

(level   =  1,  gates  =  10) 


1 
1.1 


3i[Bl     A    7r     Bl.] 
n      j=0      J 
j7n 

i.e.,  only  one  Bln  is  true. 


Detection  and  selection: 
3  

Bl         7T   Bl. 

j=0     J 
j7n 


(gates  =  20) 
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n 
1.2  3n[Bl     A     Z     Bl.] 

"       j=0       J 

i.e.,  more  than  one  Bl     is  true. 

n 

Detection: 

There  is  no  need  to  detect  this  case.  We  always  choose  a  true 

Bln-  Since  in  case  1.1  Bl  is  unique,  we  cannot  make  a  selection 

in  conflict  with  case  1.1. 

Selection: 

n-1 

1.2.1     3nEB1n  B0   *  B'H 

Detection  and  Selection: 

Clearly,  we  can  select  the  unique  Bl  ,  satisfying  the  above. 
Thus,  we  have 
n-1 


Bl  BO   77  Bl.  (Gates  =  18) 

n   n  j=Q   j 

n-1 

1.2.2     Vn[Bl  v  BO  v  Z  Bl.J 
n    n   j=0   J 

i.e.,  the  first  true  Bl  occurs  with  a  true  BO  . 

n  n 

Detection  and  Selection: 

In  this  case  we  must  assure  ourselves  that  there  is  a  smaller 

n  with  B0n  true  before  we  can  select  this.  To  insure  that  the 

selection  will  be  unique,  we  must  choose  the  first  Bl  with  a 

n 

smaller  BO 
n 

n-1      n-1 
(n  =  1,2,3)  Bl  [(  Z  BO.)  (77  Bl  .)  v 
n    j=0   J   j=0   J 

n-1        j-1  

Z  (Bl.  BO.  7T  BO.  )] 
j=0    J   J  k=0   J 
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Gates  = 

8 

n=l 

15 

n=2 

2-3 

n=3 

46 

Total 

Summary  for  Rl 

R1n       = 

B1n 

3 

TT     Bl .      V 
j=0      J 
#1 

B1n 

n-1 

B0M     TT     Bl  . 

n  j=o     J 

B1n 

n-1 

(  Z     BO.)    ( 
j=0       J 

n-1  

TT  Bl.  )  v 
j=0   J 

n-1        j-1  

Bl  z     (Bl.  BO.  TT  BO.  ) 
n  j=0    J   J  k=0   J 

(note  for  the  last  two  terms,  n  =  1,2,3) 
(Gates  =  84,  Level  =  1) 


2    R2 


"n 


2.1       Vn[ROn  =  RlnJ 


Detection 


D2.1  =  I   RO  Rl  (Gates  =  12,  Level  =  2) 


i=0  n   n 


Selection: 
2.2.1     Vi[B20n  ■*  ROn] 
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Detection 


D2.1-1  =   7T  (B20n  v  R0n)     (Gates  =  20,  Level  =  2) 


This  gate  count  takes  advantage  of  R0  RO .  =  n=j 


Selection: 
2.1.1.1    3n[B21n] 


Detection: 

3 
D2. 1.1.1  =   z  B21  (Gates  =  4,  Level  =  1) 

n=0    n 


Selection: 

n-1 


S2. 1.1.1     =     B2.1         7T     B2.1.  (Gates  =  40,  Level   =  1) 

n  n     j=0  J 


2.1.1.2         V  [B21   ] 
nL       nJ 


Detection: 


D2. 1.1.1 


Selection 
B20. 


n 


2.1-2  3n[B20M  R0~] 

n      n 


Detection: 


D2.1.1 
Selection: 


n-1 


S.2.1.2n     =     B20n     7T     (B20.     v     R0n) 

(Gates  =  25,  Level  =  2) 


"n  •  n   .=0  v       j  n' 


252 


This  gate  count  takes  advantage  of  RO     RO.     =     n=j 

J 


2.2     3n[R0n  Rln] 


Detection: 


D2.1 


Selection: 

2.2-1  Vn[B20    *  (RO     v  Rl    )1 

L       n       v     n  n/J 


Detection: 


D2.2.2 


Selection: 


First  get  B20     f  RO 
3  n  n 


n-1 


TS2.2.2n  =  B20jB0n  v  B0„     tt     BO  J 

(Gates  =  26,  Level  =  1) 


•n  nL     n  n  j=Q     wj 


n-1 


S2.2.2n     =  TS2.2.1n  Rln     tt     (TS2.2.1  .     v  Rl  .) 

(Gates  =  30,  Level  =  2) 

Selection: 

n-1 


S2.2-1       =  B20„     tt     B20. 
n  n  j=0        J 


2.2.2  3n[B20     RO     Rl   J 

n      n      nJ 
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Detection: 


D2.2-2  =  Z  B20  RO  Rl 
n=0    n   n   n 


Summary  2 


R2n  =  D2.1  D2. 1.1  D2. 1.1.1  S2. 1.1.1   v 


D2.1  D2.1.1  D2. 1.1.1  B20   v 

n 


D2.1  D2.1.1  S2.1.2   v 
n 


D2.1  D2.2.2  S2.2.1   v 
n 

D27T  D2.2.2  S2.2.2 
n 

(Gates  =  22,  Level  =  3) 


3.1  Vn(ROn  Rln  v  Rln  R2n  v  R0n  Rln) 

i.e.,  for  each  of  RO,  Rl ,  R2,  they  are  true  for  different  values  of 
n.  No  two  of  them  are  true  for  the  same  n. 

Detection: 

D27T  means  RO  f   Rl 

T3,1n  =  ROn  v  R1n  (Gates  =  8»  Level  =  2) 


D3.1  =  D2.1  I   (T3.1  R20.)      (Gates  =  16,  Level  =  2) 
n=0    n    n 


Selection: 

3.1.1     Vn(B30„  -►  R0n  v  Rl  v  R2  ) 
n    n    n    n 


Detection: 


D3.1.1  =  I  B30  RO  Rl  R2n    (Gates  =  20,  Level  =  4) 


n=0   n   n   n   n 
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Selection 


Choose  first  B30n  which  is  true. 
n-1 

j=o 


S3.1.1n  =  B30n  _t7rrt  B30j  (Gates  =  10,  Level   =  1) 


3.1.2  3n(B3°n^^^n~) 


Detection: 
D3.1.1 


Selection: 

We  must  choose  first  B30n  not  equal   RO     or  Rl     or  R2  .     First 

n  n  n  n 

we  get  the  B30n  not  equal   RO     and  Rl    . 

TS3.1.2n     =     B30nRTRr  (Gates  =  12,  Level   =  2) 

n-1 
S3.1.2n       =     TS3.1.2n     tt       CTS2.1.2,  v  R20.) 

(Gates  =  25,  Level   =  4) 
3.2  3n(R0n  Rln  v  Rln  R2n  v  R0n  Rln) 

Two  or  more  of  RO,  Rl ,  R2  agree. 
Detection: 


D3.1 
Selection : 


3.2.1  3n  R0„  Rl      R2 

n      n       n 
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Detection: 


D3.2.1  =   E  R()  Rl  R2 
n=0   n   n   n 


(Gates  =  16,  Level  =  4) 


Selection: 


3.2.1.1   3n(B30n  v  R0n) 


Detection: 


D3.2.1.1  =  Z  B30  RO 
n=0   n   n 


(Gates  =  12,  Level  =  2) 


Selection: 


S3. 2. 1.1  =  R0n  B30n  7T  (RO  v  B30  ) 
n    n   n  m=Q        m     m' 


3.2.1.2    Vn(B30  +   RO  ) 
n    n 


Detection 


(Gates  =  96,  Level  =  2) 


D3.2.1.2 


Selection 


3.2.1.2-1    Vn  B31 


Detection: 


D3. 2. 1.2-1  =  77  B31 
n=0 


(Gates  =  4,  Level  =  1) 


Selection: 

3.2.1.2-1.1   3n  B32 
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Detection: 

3 
D3. 2. 1.2-1.1  =     z     B32n  (Gates  =  4,  Level   =  1) 

n=0        " 

Selection:  ' 

n-1 
S3. 2. 1.2-1.1       =     B32„  A     tt         B32 


n  n  n  n 

m=0 


(Gates  =  40,  Level  =  1) 


3.2.1.2.1.2       Vn  B3T" 

n 


Detection: 


D3. 2. 1.2-1.1 


Selection 
R30 


i 


3.2.1.2-2        3n  B31 

n 


Detection: 


D3. 2. 1.2-2 

Selection: 

n-1 


B31       tt       B31 
n  m=0  n 


Summary  for  3.2.1.2 

S3. 2. 1.2       =     D3. 2. 1.2-1  A  D3. 2. 1.2-1.1  A  S3. 2.1. 2-1.1        v 
"  n 


D3. 2. 1.2-1   A  D3. 2. 1.2-1.1  A  R30.     v 

_____  ""I 

D3. 2. 1.2-2  B31       tt       B31~ 


m=0 


(Gates  =  46,  Level  =  4) 
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3.2.2  Vn  (ROn  v  RTrT] 


Detection 


D3.2.1 
Selection: 


3.2.2.1         3n(B30     RO     FT  v  B30     Rl      R2     v  B30     RO     R2  ) 
nnn  nnn  nnn 

B30     is  true  for  a  case  when  at  most  one  of  RO,  Rl ,  R2  is  true. 

Detection: 

First  we  get  all  B30  unequal  to  R30  and  R31  . 

■       n    ^        n       n 


TA3.2.2.2.1  =  B30n  R0„  Rl       (Gates  =  12,  Level  =  2) 
n     n   n   n  ' 


In  addition,  we  need: 


TB3.2.2.1    =  B30  RO   v  B30  Rl 
n       n   n       n   n 


3  

D3.2.2.1  =  l     T3.2.2.1  R2 
n=0        n   n 


(Gates  =  24,  Level  =  2) 


(Gates  =  12,  Level  =  4) 


Selection: 


3.2.2.1.1   3n(B30M  R0„  Rl  R2  ) 
n   n   n   n 


Detection: 


D3.2.2.1.1  =  I   B30  RO  Rl  R2 
n=0   n   n   n   n 


(Gates  =  20,  Level  =  4) 
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Selection: 


First  we  set  all  B30M  not  equal  to  either  RO  or  Rl  . 

n  n     n 


T3.2.2.1.1   =  B30  RO  Rl 
n      n   n   n 


S3. 2. 2. 1.1   =  TA3.2.2.1.1   v 
n  n 


(Gates  =  12,  Level  =  2) 


n-1 


n  m=0  n 


TB3.2.2.1.1„  7T  TB3.2.2.1.1 

(Gates  =  23,  Level  =  4) 


3.2.2.1.2    Vn  (B30„  v  RO  v  Rl  v  R2~) 

n    n    n    n 


Detection 


D3.2.2TTTT 


Selection: 

From  case  3.2.2.1  we  know  B30n  is  true  when  only  one  RO,  Rl , 
or  R2  is  true.  From  case  3.2.2  we  know  RO,  Rl ,  R2  agree  for 
one  value  of  n.  Since  each  is  true  for  exactly  one  n,  there 
can  only  be  a  single  n  for  which  those  conditions  and  the 
above  condition  hold.  We  will  use  previously  generated  terms 

S3.2.2.1.2n  =  TA3.2.2.1   v  TB3.2.2.1  R2~ 
n  n  n   n 

(Gates  =  16,  Level  =  4) 
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3.2.2.2       Vn   (B30     ■>  ROn  Rl       v     Rl     R2n     v  ROn  Rl) 
n  n      n  n       n  n      n 


Detection 


D3.2.2.1 
Selection: 


3.2.2.2.1     3n   (B31    ) 
n 


Detection: 


3 
D3.2.2.2.1     =       Z     B31  (Gates  =  4,  Level  =  1) 


n=0         n 


Selection: 


3.2.2.2.1.1       3n  B31     RO     Rl     R2 

n       n      n       n 


Detection: 


3 
D3. 2. 2. 2. 1.1         Z     B31     W~  RT~  R7~ 
n=0         "       n       n       n 


(Gates  =  20,  Level   =  4) 


Selection: 


T3. 2. 2. 2. 1.1     =  RO     Rl     B31 
n  n      n        n 


S3. 2. 2. 2. 1.1     =  R2M  T3. 2. 2. 2. 1.1       A 
n          n  n 

n-1  

77  (R2n  v  T3. 2. 2. 2. 1.1    ) 


m=0        n 


(Gates  =  48,  Level   =  4) 
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3.2.2.2.1.2   Vn  [B31  ■*  (R0n  v  Rl  v  R2  )J 

n     n    n    n/J 


Detection: 


D3. 2. 2. 2. 1.1 


Selection 


B31n  satisfying  the  above  must  be  unique.  This  is  true  because 

from  3.2.2.2  we  know  R05  Rl ,  R2  agree  for  some  value  of  n. 

Thus,  thre  is  only  o-e  value  of  n  for  which  exactly  one  of  them 

is  true.  B30n  is  true  for  the  value  where  two  of  TO,  Rl ,  R2 

agree.  B30  +   B3T.  Thus,  there  is  a  unique  n  for  which  B31 
n  n 

is  true  and  exactly  one  of  RO,  Rl ,  R2  is  true. 


TA3.2.2.2.1.2 


TB3.2.2.2.1.2 


B31n  R0n  RT 


(Gates  =  12,  Level  =  2) 


S3. 2. 2. 2. 1.2 


n 


B31  R0n  v  B31   Rl 
n   n       n   n 

(Gates  =  24,  Level  =  2) 

TA3.2.2.2.1.2„  v  TB3.2.2.2. 1 .2 
n  n 

(Gates  =  8,  Level  =  4) 


3.2.2.2.2   Vn  B31 


Detection 


3.2.2.2.1 


Selection 


There  is  a  unique  B30n  true  which  agrees  with  two  of  RO,  Rl,  R2. 
Select  this  n. 


B30 


261 


Summary  for  R3 
J  n 


R3n       =     D3.1   A  D3.1.1  A  S3. 1.1     v 


D3.1   A  D3.1.1   A  S3.1.2     v 


D3.1   A  D3.1.1  A  D3.2.1.1       S3. 2. 1.1       v 

n 


D3.1   A  D3.1.1  A  D3.2.1.2  A  S3. 2. 1.2       v 

n 


D3.1  A  D3.2.1  A  D3.2.2.1  A  D3.2.2.1.1  A  S3. 2. 2. 1.1       v 

n 


D3.1  A  D3.2.1   A  D3.2.2.1  A  D3.2.2.1.1       S3. 2. 2. 1.2       v 

n 


D37T  A  D3.2.1  A  D3.2.2.1  A  D3.2.2.2.1  A  D3. 2. 2. 2. 1.1  A 


S3. 2. 2. 2. 1.1       v 
n 


D37T  A  D3.2.1  A  D3.2.2.1  A  D3.2.2.2.1       D3. 2. 2. 2. 1.1  A 


S3. 2. 2. 2. 1.2       v 
n 


D3. 1  A  D3.2.1  A  D3.2.2.1  A  D3.2.2.2.1  A  B30 

n 


(Gates  =  48,  Level  =  5) 
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